Wednesday, 20 June 2012

Scraping with Groovy (II): Geb

The definition of Geb from its web site is "very groovy browser automation… web testing, screen scraping and more"

I've been using Geb mostly for functional testing (or more accurately for acceptation testing) but it could be also a very powerful weapon for scraping web pages. We saw how XmlSlurper is enough for scraping simple pages when no interaction is needed. But things normally get complicated.

Geb can use your favorite browser engine underneath to scrap really complicated pages with javascript, css... etc. But you if you like you can also use HtmlUnit like your browser engine. This way you can execute your script in any environment (No need to install any browser in every environment). Said so, it's also true that HtmlUnit has some limitations, specially if the page you are scraping used javascript in a very intensive way.

Geb uses a really simple DSL and a "JQuery like" syntax (with the $ character) for traversing html pages. This could be very handy sometimes when trying to look for a given DOM node. If you know how to get it with JQuery the solution with Gen will be pretty the same (read the documentation anyway).

As this is a really short example there's no need of higher abstraction, just getting a page, traversing the page, trying to get the relevant elements. But when trying to do the same stuff through several pages it's gonna be a mess. Geb solves those situations with the "Page Object Pattern".

The following code goes to the Grails plugin portal and looks for plugins related to the "Geb" term. For this example I've used latest Geb release (0.7.0). It's just a groovyConsole script, you can copy&paste and then run it (Dependencies are declared with Grapes).

@Grapes([  
    @Grab("org.codehaus.geb:geb-core:0.7.0"),
    @Grab("org.seleniumhq.selenium:selenium-htmlunit-driver:2.23.1"),
    @Grab("org.seleniumhq.selenium:selenium-support:2.23.1")
])
import geb.Browser

Browser.drive{
    // driver.webClient.javaScriptEnabled = true
    go "http://grails.org/plugins/"      
                
    $("input", name: "q").value("geb")
    $("input", class: "searchButton").click()   
             
    waitFor(10){ $("div", class:'currentPlugin') }
        
    def pluginNames = $("div", class:'currentPlugin').collect{div->
        div.find("h4").find("a").text()    
    }
    
    println pluginNames
            
}
    

  • Start the script inside the DSL and go to the page we want to start our scraping work. This time I've disabled the javascript interaction because form submission can work without javascript.

Browser.drive{  
 // driver.webClient.javaScriptEnabled = true
go "http://grails.org/plugins/" 
//...
}

  • Look for relevant elements from the form. Then put some value inside the input text and click the form button.

$("input", name: "q").value("geb")
    $("input", class: "searchButton").click()   

  • In this moment the "browser" is sending the request and getting the response. We want to be sure the response has been received before continuing asking. In this case I waited for 10 seconds the given div to show up (by default is 5 seconds).

waitFor(10){ $("div", class:'currentPlugin') }

  • Once the page is ready we want to get the list of the retrieved plugins' names

def pluginNames = $("div", class:'currentPlugin').collect{div->
        div.find("h4").find("a").text()    
    }
    
    println pluginNames

Bottom line, I would use Geb :
  • For testing
  • For scraping really complicated pages in an elegant way thanks to its DSL
Resources:


1 comment:

  1. The grails plugins site has changed.

    So:

    import geb.Browser

    Browser.drive{
    // driver.webClient.javaScriptEnabled = true
    go "http://grails.org/plugins/"

    $("input", name: "q").value("geb")
    $("i", class: "icon-search").click()

    waitFor(10){ $("section", class:'plugins') }

    def pluginNames = $("article", class:'plugin').collect{div->
    div.find("header").find("h3").find("a").text()
    }

    println pluginNames

    }

    ReplyDelete