Sometimes you encounter a web task that is soooo boring you'd wish you had robot doing it for you. I encountered a few of those during this summer so I figured out it was time to acquire some Web Scraping and Automation Ninja skillz.
Turned out it's fairly easy, thanks to very useful libraries made for this very same reason. I'm therefore going to give you a quick overview of its simplicity, show you how much you can achieve with only a few lines of code and I'll conclude this blog post with a real-world script example proving med' school students should learn programming ![]()
Ruby and Mechanize to the rescue
Nowadays, when it comes to scripts I would suggest two languages: Ruby or Python. They are so powerful and so convenient, it's very hard to argue for anything else unless you have strong constraints.
Ruby or Python?
It's mostly a matter of taste, and my heart leans toward Ruby for various reasons (that I won't develop here to prevent from an epic troll), hence the use of Ruby here. In the Ruby land, I found two libraries to be quite valuable for web scraping, automation and pentesting : Mechanize and WWMD.
In this post, I'm going to illustrate Mechanize, WWMD will be used in a later post. With Mechanize it's fairly easy to connect to an HTTP or HTTPS web page, retrieve its content, parse it, fill some forms, click a button or all these kind of things.
A basic introduction to Mechanize
For example, let's say you have a HTTPS web page, with a form named "connexion", a text field for the user and pass information and a button to login. You could use the following code to authenticate:
-
agent = WWW::Mechanize.new { |agent|
-
agent.user_agent_alias = 'Mac FireFox'
-
}
-
-
page = agent.get('https://www.example.com/')
-
connform = page.form('connexion')
-
connform.user = "user1"
-
connform.pass = "pass1"
-
page = agent.submit(connform, connform.buttons.first)
Easy enough, isn't it?
Mechanize has a lot of syntaxic sugar, for example if the form has multiple buttons with assigned names you could do:
-
connform.button_with(:name => 'login')
the "_with" sugar applies to pretty much anything the page object contains, such as links, forms, buttons etc.
With Mechanize it's as easy to check a radiobox or select an option from a selection list:
-
form.radiobuttons_with(:name => 'box').first.check
-
form.field_with(:name => 'list').options[0].select
Or 'clicking' on a link:
-
agent.click page.link_with(:text => 'News')
Mechanize also supports standard xpath scraping, so let's say you want to find a bold text in a paragraph inside a div with an id='foo':
-
value = page.search("//div[@id='foo']/p/b").text
You got the idea, everything is fairly straight forward and I strongly urge you to read Mechanize's documentation and a general xpath guideline.
A real world example
In a later post, I'll show you how to apply these principle to implement a game bot/trainer, but for this small introduction, let's take a simple example.
At my school, the first year students have to choose a schedule (among 8 of them) at the beginning of the year. Some of them are more popular than others, and only a few places are released for each schedule at random hours every day. Because of this, you have at any given time approx. 100 students frenetically pressing the 'F5' key to try to catch a spot in the popular schedules (yeah, first year of med' school students are under high pressure and become robots
). When places are added, they are all gone in approx. 2s or less, so you need either of the following features to secure a spot:
- Being Lucky
- Being patient
- Being a robot in front of your computer 24/7
- Having Ninja speed and skills
- All of the above
Or... You can yet prove again than brain 0wn$ muscles and learn Ruby programming
The following script is fairly simple and should be a good example of multiple features of Mechanize. It basically authenticates with a chosen user credential to the website, then try to select a given schedule. If it fails, it sleeps 1s (remember places are all gone in less than 2s for the popular ones...), and try again.
When a schedule is available, the script still has to choose a sub-group. So it enters the Ninja auto-fire mode, and quickly tries all groups available from the select list in the form. As soon as it succeeds, it validates the choice and stop. You also have a few sanity checks in the script to prevent from wrongly selecting a schedule/group in case of failing attempt from the server to fool automation...
I introduced a few bugs on purpose in this script to prevent script kiddies from my school to re-use it as is, insofar as it is given here solely to illustrate the subject matter. I'm making a wild guess that if one of them is smart enough to figure out how to install the ruby environment and dependencies, fix the bugs I introduced and run it successfully, he would probably be skilled enough to write it himself. So no real harm is done
Here is the script (with anti-script kiddies bugs), I hope it will be self-documented enough to understand it. Drop me a comment otherwise.
-
#! /usr/bin/env ruby
-
-
require 'rubygems'
-
require 'pp'
-
require 'mechanize'
-
-
def print_curchoice(agent, link)
-
page = agent.click link
-
curchoice = page.search("//div[@id='page']/p/b").text
-
puts "Current choice: #{curchoice}"
-
end
-
-
def select_sp(agent, link, sp)
-
page = agent.click link
-
form = page.forms.first
-
form.section = "SP#{sp}"
-
puts "Selecting SP#{sp}..."
-
page = agent.submit(form, form.buttons.first)
-
error = page.search("//div[@id='page']/p[@class='erreur']").first
-
return false unless error
-
form = page.forms.first
-
# Iterate through the group until we can find a successful one
-
for option in form.field_with(:name => 'group').options do
-
option.select
-
grp = form.groupe
-
puts "Selecting group #{grp}..."
-
result = agent.submit(form, form.buttons.first)
-
section = result.search("//div[@id='egap']/p/b").text
-
# Sanity check, just in case...
-
unless section && section == "SP#{sp}#{grp}"
-
puts "Ooops..."
-
next
-
end
-
puts "Validation..."
-
result = agent.submit(result.forms.first, result.forms.first.button_with('bdeclare'))
-
msg = result.search("//div[@id='page']/p[@class='msg']").text
-
unless msg && msg == "Votre inscription au groupe SP#{sp}#{grp} a été enregistrée."
-
puts "Ooops..."
-
next
-
end
-
puts "Epic Win!
" -
return true
-
end
-
return false
-
end
-
-
if ARGV.size != 3
-
puts "========================="
-
puts "SPSelect by ChoJin © 2009"
-
puts "========================="
-
puts "./spselect.rb <login> <password> <SP number>"
-
puts "example: to register to the section SP1 with username 'user1' and password 'pass1':"
-
puts "./spselect.rb user1 pass1 1"
-
exit 1
-
end
-
-
agent = WWW::Mechanize.new { |agent|
-
agent.user_agent_alias = 'Mac FireFox'
-
}
-
-
page = agent.get('https://www.biomedicale.univ-paris5.fr/scola/scola/')
-
connform = page.form('connexion')
-
connform.user = ARGV[0]
-
connform.pass = ARGV[1]
-
page = agent.submit(connform, connform.buttons.first)
-
-
link_choice = page.link_with(:text => 'Choix semestre 1')
-
link_cur_choice = page.link_with(:text => 'Mon groupe semestre 1')
-
-
print_curchoice(agent, link_cur_choice)
-
-
while true
-
ret = select_sp(agent, link_choice, 8)
-
break if ret == true
-
sleep(1)
-
end
That's all folks!
digg
del.icio.us
Reddit
NewsVine