ChoJin’s Quarter
Computer Sciences, Cooking, Photography and Filmmaking…
Ruby and Web Scraping/Automation

Posted on Sunday 6 September 2009

Sometimes you encounter a web task that is soooo boring you'd wish you had robot doing it for you. I encountered a few of those during this summer so I figured out it was time to acquire some Web Scraping and Automation Ninja skillz.

Turned out it's fairly easy, thanks to very useful libraries made for this very same reason. I'm therefore going to give you a quick overview of its simplicity, show you how much you can achieve with only a few lines of code and I'll conclude this blog post with a real-world script example proving med' school students should learn programming :P

Ruby and Mechanize to the rescue

Nowadays, when it comes to scripts I would suggest two languages: Ruby or Python. They are so powerful and so convenient, it's very hard to argue for anything else unless you have strong constraints.

Ruby or Python?
It's mostly a matter of taste, and my heart leans toward Ruby for various reasons (that I won't develop here to prevent from an epic troll), hence the use of Ruby here. In the Ruby land, I found two libraries to be quite valuable for web scraping, automation and pentesting : Mechanize and WWMD.

In this post, I'm going to illustrate Mechanize, WWMD will be used in a later post. With Mechanize it's fairly easy to connect to an HTTP or HTTPS web page, retrieve its content, parse it, fill some forms, click a button or all these kind of things.

A basic introduction to Mechanize

For example, let's say you have a HTTPS web page, with a form named "connexion", a text field for the user and pass information and a button to login. You could use the following code to authenticate:

  1. agent = WWW::Mechanize.new { |agent|
  2.   agent.user_agent_alias = 'Mac FireFox'
  3. }
  4.  
  5. page = agent.get('https://www.example.com/')
  6. connform = page.form('connexion')
  7. connform.user = "user1"
  8. connform.pass = "pass1"
  9. page = agent.submit(connform, connform.buttons.first)

Easy enough, isn't it?

Mechanize has a lot of syntaxic sugar, for example if the form has multiple buttons with assigned names you could do:

  1. connform.button_with(:name => 'login')

the "_with" sugar applies to pretty much anything the page object contains, such as links, forms, buttons etc.

With Mechanize it's as easy to check a radiobox or select an option from a selection list:

  1. form.radiobuttons_with(:name => 'box').first.check
  2. form.field_with(:name => 'list').options[0].select

Or 'clicking' on a link:

  1. agent.click page.link_with(:text => 'News')

Mechanize also supports standard xpath scraping, so let's say you want to find a bold text in a paragraph inside a div with an id='foo':

  1. value = page.search("//div[@id='foo']/p/b").text

You got the idea, everything is fairly straight forward and I strongly urge you to read Mechanize's documentation and a general xpath guideline.

A real world example

In a later post, I'll show you how to apply these principle to implement a game bot/trainer, but for this small introduction, let's take a simple example.

At my school, the first year students have to choose a schedule (among 8 of them) at the beginning of the year. Some of them are more popular than others, and only a few places are released for each schedule at random hours every day. Because of this, you have at any given time approx. 100 students frenetically pressing the 'F5' key to try to catch a spot in the popular schedules (yeah, first year of med' school students are under high pressure and become robots ;) ). When places are added, they are all gone in approx. 2s or less, so you need either of the following features to secure a spot:

  • Being Lucky
  • Being patient
  • Being a robot in front of your computer 24/7
  • Having Ninja speed and skills
  • All of the above

Or... You can yet prove again than brain 0wn$ muscles and learn Ruby programming ;)

The following script is fairly simple and should be a good example of multiple features of Mechanize. It basically authenticates with a chosen user credential to the website, then try to select a given schedule. If it fails, it sleeps 1s (remember places are all gone in less than 2s for the popular ones...), and try again.

When a schedule is available, the script still has to choose a sub-group. So it enters the Ninja auto-fire mode, and quickly tries all groups available from the select list in the form. As soon as it succeeds, it validates the choice and stop. You also have a few sanity checks in the script to prevent from wrongly selecting a schedule/group in case of failing attempt from the server to fool automation...

I introduced a few bugs on purpose in this script to prevent script kiddies from my school to re-use it as is, insofar as it is given here solely to illustrate the subject matter. I'm making a wild guess that if one of them is smart enough to figure out how to install the ruby environment and dependencies, fix the bugs I introduced and run it successfully, he would probably be skilled enough to write it himself. So no real harm is done :)

Here is the script (with anti-script kiddies bugs), I hope it will be self-documented enough to understand it. Drop me a comment otherwise.

  1. #! /usr/bin/env ruby
  2.  
  3. require 'rubygems'
  4. require 'pp'
  5. require 'mechanize'
  6.  
  7. def print_curchoice(agent, link)
  8.   page = agent.click link
  9.   curchoice = page.search("//div[@id='page']/p/b").text
  10.   puts "Current choice: #{curchoice}"
  11. end
  12.  
  13. def select_sp(agent, link, sp)
  14.   page = agent.click link
  15.   form = page.forms.first
  16.   form.section = "SP#{sp}"
  17.   puts "Selecting SP#{sp}..."
  18.   page = agent.submit(form, form.buttons.first)
  19.   error = page.search("//div[@id='page']/p[@class='erreur']").first
  20.   return false unless error
  21.   form = page.forms.first
  22.   # Iterate through the group until we can find a successful one
  23.   for option in form.field_with(:name => 'group').options do
  24.     option.select
  25.     grp = form.groupe
  26.     puts "Selecting group #{grp}..."
  27.     result = agent.submit(form, form.buttons.first)
  28.     section = result.search("//div[@id='egap']/p/b").text
  29.     # Sanity check, just in case...
  30.     unless section && section == "SP#{sp}#{grp}"
  31.       puts "Ooops..."
  32.       next
  33.     end
  34.     puts "Validation..."
  35.     result = agent.submit(result.forms.first, result.forms.first.button_with('bdeclare'))
  36.     msg = result.search("//div[@id='page']/p[@class='msg']").text
  37.     unless msg && msg == "Votre inscription au groupe SP#{sp}#{grp} a été enregistrée."
  38.       puts "Ooops..."
  39.       next
  40.     end
  41.     puts "Epic Win! ;) "
  42.     return true   
  43.   end
  44.   return false
  45. end
  46.  
  47. if ARGV.size != 3
  48.   puts "========================="
  49.   puts "SPSelect by ChoJin © 2009"
  50.   puts "========================="
  51.   puts "./spselect.rb <login> <password> <SP number>"
  52.   puts "example: to register to the section SP1 with username 'user1' and password 'pass1':"
  53.   puts "./spselect.rb user1 pass1 1"
  54.   exit 1
  55. end
  56.  
  57. agent = WWW::Mechanize.new { |agent|
  58.   agent.user_agent_alias = 'Mac FireFox'
  59. }
  60.  
  61. page = agent.get('https://www.biomedicale.univ-paris5.fr/scola/scola/')
  62. connform = page.form('connexion')
  63. connform.user = ARGV[0]
  64. connform.pass = ARGV[1]
  65. page = agent.submit(connform, connform.buttons.first)
  66.  
  67. link_choice = page.link_with(:text => 'Choix semestre 1')
  68. link_cur_choice = page.link_with(:text => 'Mon groupe semestre 1')
  69.  
  70. print_curchoice(agent, link_cur_choice)
  71.  
  72. while true
  73.   ret = select_sp(agent, link_choice, 8)
  74.   break if ret == true
  75.   sleep(1)
  76. end

That's all folks!

Share this story using:These icons link to social bookmarking sites where readers can share and discover new web pages.

No comments have been added to this post yet.

Leave a comment

(required)

(required)


Information for comment users
Line and paragraph breaks are implemented automatically. Your e-mail address is never displayed. Please consider what you're posting.

Use the buttons below to customise your comment.


Comment moderation is in use. Please do not submit your comment twice -- it will appear shortly.

RSS feed for comments on this post | TrackBack URI