Wednesday, 20 June 2012

Scraping with Groovy (I): XmlSlurper

There're times when a man have to scrap web pages, and when those times come... you'd better have Groovy.

Here I'm gonna write about some things I've learned about scraping with Groovy. I will use core functionalities like XmlSlurper and GPath, and some more advanced tools like Geb.

Right now I'm following the UEFA EURO 2012 (Go Spain!!) and I want to get the final list of the teams playing the quarter finals. I already now the url so I just have to parse the page and take the list of the first table. In this post I'm gonna explain how to use XmlSlurper to do so.

Time to analysis:

First thing I have to do is to locate the html where the results are. Well, activate your firebug (Firefox) or   "Inspect Element" on Safari, and inspect the html. In this case we have the following structure:

  • (1) All phases are group inside div elements with class "box boxSimple" so I need the first
  • (2) Inside that div we have to look for a list (ul element)
  • (3) Every element in the list is assigned to a team. I need the name of each team. They are located inside the attribute @title inside the first anchor available inside each element of the list.
Before coding:

We are gonna use XmlSluper to parse an HTML page, but there's one problem XmlSlurper only parses XML. Don't worry, to fix that you need the NekoHtml parser. It converts an HTML page into an XHTML page. Then you'll be able to parse the page and retrieve the information we are looking for.

Let's code:

I've used groovyConsole to do it. Pay attention to the dependency.

    @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper

/* Getting the xhtml page thanks to Neko SAX parser */
def imdbPageXml = new XmlSlurper(new SAXParser()).
def teams = 
 /* (1) Go to first <div class='box boxSimple'> */
    imdbPageXml.'**'.find{ == 'DIV' && it.@class=='box boxSimple'}.
    /* (2) Folow the path */
         /* (3) For each element in list get the title of the first anchor */
            li.'**'.find{ == 'A'}*.@title

println teams

Once XmlSlurper returns a navigable instance of GPathResult the object can be traversed using  GPath:
  • Go to first div with class 'box boxSimple'
imdbPageXml.'**'.find{ == 'DIV' && it.@class=='box boxSimple'}.

To avoid to put explicit all the way to the element I'm looking for I started using '**' and then a query. That line means "Look everywhere in the document and take the first element with name div and having the attribute class with value 'box boxSimple'.  Remember '**' could be really useful.

I use find instead of findAll to get only the first element. Also notice all HTML elements are referred in upper case.
  • Folow the path to the elements of the list

This path is very short and I didn't know how to do it shorter, so I put the path explicitly.
  • For each element get the title of the first anchor
li.'**'.find{ == 'A'}*.@title

"Everywhere inside this 'li' element look for an element with name 'a'. Then get its @title attribute value.

Bottom line, use XmlSlurper for scraping when:

  • If you're scraping WebServices returning well formed XML
  • If you're scraping simple pages and you're not going to interact with the page. Then use XmlSluper with the Neko parse to ensure the page conversion to XML.
  • If you're going to interact with the page and still want to do it with Groovy then I'd recommend you to use Geb

No comments:

Post a Comment