desmontandojava: xml

Showing posts with label xml. Show all posts

Wednesday, 10 April 2013

Groovy Xml Series: JAXB

Although the Groovy way of dealing with XML is simply great, sometimes it's nice to use Jaxb for things such as schema validation or maybe for establishing what fields are involved in the process and what others are not.

This is not going to be an entry of JAXB specially when there are hundreds of blog entries and even books covering the topic widely.

I'm going to create a very simple example to unmarshall a given object instance to XML. So here we have the XML source

                                                                                                            
                       
                       
                   
                   Don Xijote       
                   Manuel De Cervantes
                       
                   
                   Catcher in the Rye
                  JD Salinger
                      
                  
                  Alice in Wonderland
                  Lewis Carroll
              
                   
                  Don Xijote       
                  Manuel De Cervantes

Before doing anything, just remind you that the code is available at Github.

And now we should be creating the classes which are going to map the XML fragments in which I'm interested.

Author

package github.groovy.xml.jaxb                                                                                                          
                              
   import javax.xml.bind.annotation.*
   
   @XmlRootElement(name="author")
   @XmlAccessorType(XmlAccessType.FIELD)
   class Author{              
   
       @XmlAttribute Long id  
      @XmlElement String name
        
 }

Book

 package github.groovy.xml.jaxb                                                                                                          
                              
   import javax.xml.bind.annotation.*
   
   @XmlRootElement(name="book")
   @XmlAccessorType(XmlAccessType.FIELD)
   class Book{                
   
       @XmlAttribute Long id  
      @XmlElement String title
      @XmlElement Author author
          
 }

Basically I'm using the following annotations:

@XmlRootElement: for be able to marshal the object without having to include it in a another object having this annotation. Only objects annotated with this annotation can be marshaled directly.
@XmlAccessorType(XmlAccessType.FIELD): Because we're using Groovy, and there is "a lot of magic" at runtime you should tell JAXB to stick to the fields, otherwise it will try to unmarshall crazy things.
@XmlElement: To tell JAXB what's going to be a tag element
@XmlAttribute: To tell JAXB what's going to be an attribute element

Unmarshalling (XML --> Object)

Well let's say we have an XML full of books and authors and we want to unmarshal them into objects. If we wanted to unmarshall the XML above to get only books and authors, we should pre-process the XML to get rid of the response/data tags. And only then unmarshal the remaining XML.

import javax.xml.bind.*

//...Omitted code

      def "Unmarshalling the first book with boilerplate code"(){
          setup: "Building the unmarshaller"
              def jaxbContext= JAXBContext.newInstance(Book)
              def unmarshaller = jaxbContext.createUnmarshaller()
          and: "Filtering the xml"
              def response = new XmlSlurper().parse(xmlFile) 
          and: "Getting only the first book"
              def firstBook = response.'**'.find{it.name() =='book'}
              def jaxbSource= XmlUtil.serialize(firstBook)
              def newXmlSource = new StringReader(jaxbSource)
          when: "Unmarshalling the source" 
              def jaxbBook = unmarshaller.unmarshal(newXmlSource)
          then: "We make sure the conversion took place"
              jaxbBook instanceof Book        
          and: "Checking Book properties"
              jaxbBook.title 
              jaxbBook.author
              jaxbBook.author.id
      }

I've created an utility class to avoid most of the repeated code when creating a Marshaller/Unmarshaller:

package github.groovy.xml.jaxb

import javax.xml.bind.*
import github.groovy.xml.util.ResourcesUtil

/**
 * This class helps us to handle marshalling and unmarshalling of JAXB objects
**/
class JaxbUtils {

 def object2MarshalType
 def source2Unmarshall 
 
 def marshal(object){
  object2MarshalType = object.getClass()
  this
 }

 def to(File file){
  throwIfNull(object2Marshal)
  throwIfNull(file)  

  buildMarshaller(object2MarshalType).marshal(object2MarshalType,file)
 }

 def unmarshal(source){
  source2Unmarshall = source 
  this
 }

 def to(Class anyType){
  throwIfNull(source2Unmarshall)
  throwIfNull(anyType)

  buildUnmarshaller(anyType).unmarshal(source2Unmarshall)
 }

 def buildMarshaller(Class type){
  def jaxbContext= JAXBContext.newInstance(type)
  def marshaller = jaxbContext.createMarshaller()

  marshaller
 }

 def buildUnmarshaller(Class type){
  def jaxbContext= JAXBContext.newInstance(type)
  def unmarshaller = jaxbContext.createUnmarshaller()

  unmarshaller
 }


 def throwIfNull(value,message="You should have provided any value"){
  if (!value){
   throw new Exception(message)
  }
 }

}

So the previous code becomes:

      def "Unmarshalling the first book the good way"(){
          setup: "Parsing the document"
              def response = new XmlSlurper().parse(xmlFile)
          and: "Getting only the first book"
              def firstBook = response.'**'.find{it.name() =='book'}
              def jaxbSource= XmlUtil.serialize(firstBook)
          when: "Convert to Jaxb object"
              def newXmlSource = new StringReader(jaxbSource)
              def jaxbBook = unmarshal(newXmlSource).to(Book)
          then: "We make sure the conversion took place"
              jaxbBook instanceof Book
          and: "Checking Book properties"
              jaxbBook.title
              jaxbBook.author
              jaxbBook.author.id
      }

Marshalling (Object --> XML)

I was a little bit lazy and I didn't build any example for marshalling an object to XML. Maybe I will update this entry later on with a working example.

Resources

JAXB Official Page: http://jaxb.java.net
I found this entry using Groovy, JAX-WS and JAXB altogether, highly advisable: http://weblogs.java.net/blog/lamineba/archive/2012/05/20/building-restful-web-services-jax-rs-jaxb-and-groovy

Tuesday, 9 April 2013

Groovy Xml Series: Index

So far I've written a couple of entries about Groovy and Xml and although I've taken care of setting the tags in order to be able to find the entries easily, but I thought it would be better to have an index with all the entries altogether.

This way if I added a new related entry there would be a common place to look for this topic.

Groovy Xml Series: Templating

I guess in many web applications there're times when you are going to return the same XML estructure with little changes every time. So you would be looking for a general template with some placeholders to substitute at runtime.

How to this with Groovy: groovy.text.XmlTemplateEngine to the rescue :-)

The following example's been inspired in an entry of MrHaki talking about the topic. The first we have to build is the template.

Template

     def tpl = ''' 
          
              
                  
                      books.eachWithIndex{book,index->
                                      
                               
                              ${book.title}    
                                        
                                   
                                  book.author
                              
                          
                      }
                  
              
          
      '''

Several things to take into account when creating the template:

Notice that I've used single quotes. If I had used double quotes I'd have been in trouble at runtime because Groovy would have tried to substitute the placeholders beforehand.
Adding "gsp" namespace: Adding this namespace we'll be able to use scriptlet and expression tags. I'll explain both later.
Groovy expressions: Like if we were using GString expressions we can use expressions like ${index}. Again notice all variables are evaluated differently than GString instances.

Scriptlets

Basically we're going to use the tag when trying to build complex estructures or looping through a set of values. In the example:

    books.eachWithIndex{book,index->
        // Nested code
    }

Here we were looping through a list of books. Notice how we're going to be able to access to the variables created in that snippet inside the scope of the tag.

Expressions

Following the documentation the tag should be used for code fragments which produce output. In this example we want to add the book's author as the text value of the author's node.

              
       book.author

Groovy Expressions

We can be using the expressions ${expression} and $variable to add expressions to the document.

   ${book.title}

Invoking XmlTemplateEngine

Finally we should put all pieces together and set all variables in the template. For that we will be invoking the XmlTemplateEngine instance and passing a map with the variables we want to substitute in the template.

This is a method taken from a Spock specification created for testing the XmlTemplateEngine functionality.

     def "Creating a new xml using the template and some bindings"(){
          setup: "Building some data"
              def books = (1..10).collect{
                  [title:"Book${it}",author:"Author${it}"]
              }
          when: "Creating the engine instance and compiling the template"
              def engine = new XmlTemplateEngine()
              def templ = engine.createTemplate(tpl)
          and: "Binding the values with the template"
              def bindings = [books:books]
          and: "Parsing the template with the included bindings"
              def writable = templ.make(bindings)
           /* I want to see the output in the test report */
              println writable
          and: "Parsing the result to check the outcoming xml"
              def response = new XmlSlurper().parseText(writable.toString())
          then: "We should have a document with 10 books"
              response.value.books.book.size() == 10
      }

See how first we create a template instance from the XmlTemplateEngine instance.

def engine = new XmlTemplateEngine()
     def templ = engine.createTemplate(tpl)

Then we pass the books as a parameter to the template. That will give us a writable instance.

      def bindings = [books:books]
      def writable = templ.make(bindings)

At the end we only parse the document again for testing purposes.

      def response = new XmlSlurper().parseText(writable.toString())
      response.value.books.book.size() == 10

Resources

Entry Github location: https://github.com/mariogarcia/xmlgroovy
Groovy Templating: http://groovy.codehaus.org/Groovy+Templates
XmlTemplateEngine: http://groovy.codehaus.org/api/groovy/text/XmlTemplateEngine.html
Mr Haki Sample: http://mrhaki.blogspot.ie/2009/10/groovy-goodness-using-template-engines.html

Monday, 8 April 2013

Groovy Xml Series: Manipulating Xml

The Xml

In this entry I would like to review the different ways of adding / modifying / removing nodes using XmlSlurper or XmlParser. The xml we are going to be handling is the following:

      def xml = """                                                                                                                       
          
              
                  
                      
                          Don Xijote
                          Manuel De Cervantes
                      
                  
              
          
      """

Adding nodes

The main difference between XmlSlurper and XmlParser is that when former creates the nodes they won't be available until the document's been evaluated again, so you should parse the transformed document again in order to be able to see the new nodes. So keep that in mind when choosing any of both approaches.

If you needed to see a node right after creating it then XmlParser should be your choice, but if you're planning to do many changes to the XML and send the result to another process maybe XmlSlurper would be more efficient.

You can't create a new node directly using the XmlSlurper instance, but you can with XmlParser. The way of creating a new node from XmlParser is through its method createNode(..)

        def "Adding a new tag to a node"(){
          setup: "Building an instance of XmlParser"
              def parser = new XmlParser()
          and: "Parsing the xml"
              def response = parser.parseText(xml)
          when: "Adding a tag to response"
              def numberOfResults = parser.createNode(
                  response,
                  new QName("numberOfResults"),
                  [:]
              )   
          and: "Setting the node's value"
              numberOfResults.value = "1"
          then: "We should be able to find the new node"
              response.numberOfResults.text() == "1"
      }

The createNode() method receives the following parameters:

parent node (could be null)
The qualified name for the tag (In this case we only use the local part without any namespace)
A map with the tag's attributes (None in this particular case)

Anyway you won't normally be creating a node from the parser instance but from the parsed Xml instance. That is from a Node or a GPathResult instance.

Take a look at the next example. We are parsing the xml with XmlParser and then creating a new node from the parsed document's instance (Notice the method here is slightly different in the way it receives the parameters):

def "Adding a new tag to a node with the node instance"(){
          setup: "Building an instance of XmlParser"
              def parser = new XmlParser()
          and: "Parsing the xml"
              def response = parser.parseText(xml)
          when: "Appending the tag to the current node"
              response.appendNode(
                  new QName("numberOfResults"),
                  [:],
                  "1"
              )
          then: "We should be able to find it"    
              response.numberOfResults.text() == "1"
      }

When using XmlSlurper GPathResult instances don't have createNode() method.

Modifying / Removing nodes

We know how to parse the document, add new nodes, now I want to change a given node's content. Let's start using XmlParser and Node. This example changes the first book information to actually another book.

     
     def "Replacing a node"(){
          setup: "Building the parser and parsing the xml"
              def response = new XmlParser().parseText(xml)
          when: "Replacing the book 'Don Xijote' with 'To Kill a Mockingbird'"
           /* Use the same syntax as groovy.xml.MarkupBuilder */
              response.value.books.book[0].replaceNode{
                  book(id:"3"){
                      title("To Kill a Mockingbird")  
                      author(id:"3","Harper Lee")     
                  }
              }
          and: "Looking for the new node"
              def newNode = response.value.books.book[0]
          then: "Checking the result"
              newNode.name() == "book"        
              newNode.@id == "3"
              newNode.title.text() == "To Kill a Mockingbird"
              newNode.author.text() == "Harper Lee"
           /* Don't know why I have to look for the first id */
              newNode.author.@id.first() == "3"
      }

When using replaceNode() the closure we pass as parameter should follow the same rules as if we were using groovy.xml.MarkupBuilder (See resources section for more information):

tagName(attribute:attributeValue){
    nestedTag("stringcontent")
  /// etc
}

Here the same example with XmlSlurper:

def "Replacing a node"(){
          setup: "Parsing the document"
              def response = new XmlSlurper().parseText(xml) 
          when: "Replacing the book 'Don Xijote' with 'To Kill a Mockingbird'"
           /* Use the same syntax as groovy.xml.MarkupBuilder */
              response.value.books.book[0].replaceNode{
                  book(id:"3"){
                      title("To Kill a Mockingbird")  
                      author(id:"3","Harper Lee")     
                  }          
              }              
          and: "Asserting the lazyness"
              assert response.value.books.book[0].title.text() == "Don Xijote"
          and: "Rebuild the document"
           /* That mkp is a special namespace used to escape away from the normal building mode 
              of the builder and get access to helper markup methods 
              'yield', 'pi', 'comment', 'out', 'namespaces', 'xmlDeclaration' and 
              'yieldUnescaped' */
              def result = new StreamingMarkupBuilder().bind{mkp.yield response}.toString()
              def changedResponse = new XmlSlurper().parseText(result)
          then: "Looking for the new node"
              assert changedResponse.value.books.book[0].title.text() == "To Kill a Mockingbird"
      }

Notice how using XmlSlurper we have to parse the transformed document again in order to find the created nodes. In this particular example could be a little bit annoying isn't it?

Finally both parsers also use the same approach for adding a new attribute to a given attribute. This time again the difference is whether you want the new nodes to be available right away or not. First XmlParser:

     def "Adding a new attribute to a node"(){
          setup: "Building an instance of XmlParser"
              def parser = new XmlParser()    
          and: "Parsing the xml"
              def response = parser.parseText(xml)
          when: "Adding an attribute to response"
              response.@numberOfResults = "1" 
          then: "We should be able to see the new attribute"
              response.@numberOfResults == "1"
      }

And XmlSlurper:

def "Adding a new attribute to a node"(){
          setup: "Parsing the document"
              def response = new XmlSlurper().parseText(xml)
          when: "adding a new attribute to response"
              response.@numberOfResults = "2"
          then: "In attributes the node is accesible right away"
              response.@numberOfResults == "2"
      }

But, hold on a second! The XmlSlurper example didn't need to evaluate again the transformation did it? You're right. When adding a new attribute doing a new evaluation is not necessary either way.

Resources

Code at Github: https://github.com/mariogarcia/xmlgroovy
Using MarkupBuilder: http://groovy.codehaus.org/Creating+XML+using+Groovy's+MarkupBuilder
Groovy processing XML: http://groovy.codehaus.org/Processing+XML

Tuesday, 2 April 2013

Groovy Xml Series: Printing Xml

I started reviewing my knowledge of Groovy's xml capabilities due to an entry in the Groovy user's mailing list.

Somebody asked about how to print a GPath query result. Because I knew about the GPath API, I suggested to start building the output from the available methods, I mean, if you had a Node / GPathResult (Depending on the parser you used to parse the xml: XmlParser/XmlSlurper respectively) representing

value

You could print that node with the following statement:

println "<${node.name()}>${node.text()}</{node.name()}>"

But luckily somebody mentioned we could be doing the same thing using groovy.xml.XmlUtil class. It has several static methods to serialize the xml fragment from several type of sources (Node,GPathResult,String...)

Thus I've added a sample to my xmlgroovy project at Github to remember how I can for example print out the result of a GPath result. This is a slightly changed version in order to execute it on the GroovyConsole:

       

import groovy.xml.XmlUtil

def xml = """
                                                                                                            
                       
                       
                   
                   Don Xijote       
                   Manuel De Cervantes
                       
                   
                   Catcher in the Rye
                  JD Salinger
                      
                  
                 Alice in Wonderland
                  Lewis Carroll
                      
                   
                  Don Xijote       
                  Manuel De Cervantes
              
          
                     
  
"""
 
   def response = new XmlParser().parseText(xml)
   def nodeToSerialize = response.'**'.find{it.name() == 'author'}
   def nodeAsText = XmlUtil.serialize(nodeToSerialize)

   println nodeAsText

In this example I'm parsing the xml and and printing out only the author's node I was interested in. And it looks like:

Manuel De Cervantes

Missing

XmlNodePrinter : If you are using Node instances because you've parsed the document with XmlParser you can use XmlNodePrinter. You can check out the entry of MrHaki's blog where he covered the topic.

Monday, 1 April 2013

Groovy Xml Series: Querying Xml with GPath

The most common way of querying XML in Groovy is using GPath. The entry from the official page:

"GPath is a path expression language integrated into Groovy which allows parts of nested structured data to be identified. In this sense, it has similar aims and scope as XPath does for XML. The two main places where you use GPath expressions is when dealing with nested POJOs or when dealing with XML"

So it's similar to XPath expressions and you can use it not only with XML but also with POJO classes. Ok, so lets begin.

Given the following xml:

 
                                                                                                               
                       
                       
                   
                   Don Xijote       
                   Manuel De Cervantes
                       
                   
                   Catcher in the Rye
                  JD Salinger
                      
                  
                  Alice in Wonderland
                  Lewis Carroll
                      
                   
                  Don Xijote       
                  Manuel De Cervantes

Node's text content

First thing we are going to do is to get a value using POJO's notation. Lets get the first book's author's name (Code is available at Github.).

      def "Using POJO notation: Getting a node using POJOs notation a.b.c"(){
          setup: "Parsing the document"
              def response = new XmlSlurper().parse(xmlFile) 
          when: "Trying to get a given node using the a.b.c notation"
              def authorNode = response.value.books.book[0].author
          then: "We can check the author's value"
              authorNode.text() == 'Manuel De Cervantes'
      }

So first we parse the document with XmlSlurper (The xmlFile is a variable of type java.io.File) and the we have to consider the returning value as the root of the XML document, so in this case is "response".

So that's why we start traversing the document from response and then value.books.book[0].author. Note that in XPath the node arrays starts in [1] instead of [0], but because GPath is Java-based it starts in [0] index.

GPathResult (XmlSlurper) and Node (XmlParser)

In the end we'll have the instance of the "author" node and because we wanted the text inside that node we are going to call the text() method. The "author" node is an instance of GPathResult type and text() a method giving us the content of that node as a String.

When using GPath with an xml parsed with XmlSlurper we'll have as a result a GPathResult object. GPathResult has many other convenient methods to convert the text inside a node to any other type such as:

toInteger()
toFloat()
toBigInteger()
...

All these methods try to convert an String to a certain type.

If we were using a XML parsed with XmlParser we could be dealing with instances of type Node. But still all the actions applied to GPathResult in these examples could be applied to a Node as well. Creators of both parsers took into account GPath compatibility.

Attribute's content

Next step is to get the some values from a given node's attribute. In the following sample we want to get the first book's author's id. We'll be using two different approaches. Let's see the code first:

         def "Using POJO notation: Getting an attribute's value using POJOs notation a.b.c"(){
          setup: "Parsing the document"
              def response = new XmlSlurper().parse(xmlFile) 
          when: "Trying to get a given node using the a.b.c notation"
              def firstBook = response.value.books.book[0]
              def firstAuthorIdNode1 = firstBook.author.@id
              def firstAuthorIdNode2 = firstBook.author['@id']
          then: "Getting the id's value"
              firstAuthorIdNode1.toInteger() == 1
              firstAuthorIdNode2.toInteger() == 1
      }

Again we first parse the document and then using the POJO's notation we get the first book node. Now take a look at the first expression:

firstBook.author.@id
firstBook.author['@id']

I specially like the former type of notation because is more straight forward, and meaningful. The latter is more like using an instance of a map (which I guess it should be eventually).

Speeding things up: "breadfirst()" and "depthfirst()"

If you ever have used XPath you have been using the expressions like

"//" : Look everywhere
"/following-sibling::othernode" : Look for a node "othernode" in the same level

More or less we have their conterparts in Gpath with the methods breadfirst() and depthfirst(). The first example shows a simple use of breadfirst(). The creators of this methods created a shorter syntax for it using '*'.

        def "Using '*': Getting a node using breadthFirst operator '*'"(){                                                              
          setup: "Parsing the document"
              def response = new XmlSlurper().parse(xmlFile)
          when: "Looking for the node having the name 'book'"
          and: "with attribute id equals to 2"
           /* You can use the breadthFirst operator to look among a group 
              of nodes at the same level */
              def catcherInTheRye = response.value.books.'*'.find{node-> 
               /* node.@id == 2 could be expressed as node['@id'] == 2 */
                  node.name() == 'book' && node.@id == '2'
              }
          then: "Getting the author's value"
              catcherInTheRye.title.text() == 'Catcher in the Rye'
      }

This Spock specification looks for any node at the same level as "books" node first, and only if it couldn't find the node we were looking for then it will look deeper in the tree, always taking into account the given the expression inside the closure.

That expression says "Look for any node with a tag name equals 'book' and having an id with a value of '2'".

Today I woke up very lazy and I'd like to look for a given value without caring where it might be. The only thing I know is that I need the id of the author "Lewis Carroll" . How do I do that? using depthFirst()

        def "Using '**': Getting a node using depthFirst operator '**'"(){
          setup: "parsing the document"
              def response = new XmlSlurper().parse(xmlFile)                                                                              
          when: "Using the deptFirst operator we can look for something"
          and: "it doesn't matter how deep the node is"
          and: "Let's say we want to look for the book's id of the book written by Lewis Carrol"
           /* Beware of the name I used for the closure's parameter. It may look like 
              the ** is too smart, but it isn't. It's just that I'm sure only books will 
              match the query. To avoid any confusion I'd rather use 'node' */
              def bookId = response.'**'.find{book->
                  book.author.text() == 'Lewis Carroll'
              }.@id
          then: "The bookId should be 3"
              bookId == "3"
      }

Definitely is shorter that using the POJO notation isn't it? depthfirst() is the same as looking something "everywhere in the tree from this point down". In this case we've used the method find(Closure cl) to find just the first occurrence.

What if we want to collect all book's titles?

      
     def "Using '**': Collecting all titles"(){
          setup: "parsing the document"
              def response = new XmlSlurper().parse(xmlFile)
          when: "Looking for all titles within the document"
              def titles = response.'**'.findAll{node-> node.name() == 'title'}*.text()
          then: "There should be only four"
              titles.size() == 4
      }

I've mentioned there are some useful methods that convert a node's value to an integer,float...etc. Those methods could be convenient when doing comparisons like this:

      def "Using findAll: Collecting all titles"(){
          setup: "parsing the document"
              def response = new XmlSlurper().parse(xmlFile)
          when: "Looking for all titles with an id greater than 2"
              def titles = response.value.books.book.findAll{book->
               /* You can use toInteger() over the GPathResult object */
                  book.@id.toInteger() > 2
              }*.title
          then: "There should be only two"
              titles.size() == 2
      }

In this case the number 2 has been hardcoded but imagine that value could have come from any other source (Gorm id's...etc)

Resources

Groovy GPath page: http://groovy.codehaus.org/GPath
Github samples: https://github.com/mariogarcia/xmlgroovy

Thursday, 21 March 2013

Groovy Xml Series: Parsing Xml

During this week I've been reviewing a little bit the way we're dealing with XML in Groovy. My curiosity was caused because some entry in the Groovy mailing list about the use of XmlUtil. Then I decided that I needed to review my knowledge about the Groovy API for handling XML.

First things first. The first thing we need to know to start playing with xml is to parse the xml. The two ways I've been practicing this week have been XmlSlurper and XmlParser both located in groovy.util package (Which means we don't have to import them).

Both have the same approach to parse an xml, create the instance and then use one of the parse(...) or parseText(String) methods available in both:

 
   def parsedByXmlSlurper = new XmlSlurper().parseText(xmlText)
   def parsedByXmlParser = new XmlParser().parseText(xmlText)

So what is the difference between them?

Well let's see the similarities first:

Both are based on SAX so they both are low memory footprint
Both can update/transform the XML

Differences then:

XmlSlurper evaluates the structure lazily. So if you update the xml you'll have to evaluate the whole tree again.
XmlSlurper returns GPathResult when parsing an xml
XmlParser returns Node objects when parsing an xml

When to use one or the another? Well reading an entry at StackOverflow some ideas came across:

If you want to transform an existing document to another then XmlSlurper will be the choice but if you want to update and read at the same time then XmlParser is the choice. The rationale behind this is that every time you create a node with XmlSlurper it won't be available until you parse the document again with another XmlSlurper instance.

Need to read just a few nodes XmlSlurper is for you "...I would say that the slurper is faster if you just have to read a few nodes, since it will not have to create a complete structure in memory"

So far my experience is that both classes work pretty the same way. Even the way of using GPath expressions with them are the same (both use breadFirst() and depthFirst() expressions). So I guess it depends on the write/read frequency.

So let's say we have the following document:

 
    def xml = """          
              
                      
                      
                                         
                          Don Xijote       
                          Manuel De Cervantes
                      
                     
                     
          
      """

I've create a couple of Spock specs to parse and do a simple query to the parsed document. First using XmlParser:

def "Parsing an xml from a String"(){
          setup: "The parser"
              def parser = new XmlParser()    
          when: "Parsing the xml"
              def response = parser.parseText(xml)
          then: "Checking the xml's content"
              response.value.books.book[0].title.text() == "Don Xijote"
      }

And then using XmlSlurper:

def "Parsing xml from a String"(){
          setup: "Creating an instanceof XmlSlurper"
              def parser = new XmlSlurper()   
          when: "Parsing the xml as text"
              def responseNode = parser.parseText(xml)
          then: "You can check the xml's content"
              responseNode.value.books.book[0].title.text() == "Don Xijote"
      }

Can you see the difference? None apart from the name of the parser engine. In the next entry I will be writing about inserting, updating and deleting nodes with both XmlSlurper and XmlParser.

Almost forgot it. All the code is at Github if you want to check it out!

Wednesday, 20 June 2012

Scraping with Groovy (II): Geb

The definition of Geb from its web site is "very groovy browser automation… web testing, screen scraping and more"

I've been using Geb mostly for functional testing (or more accurately for acceptation testing) but it could be also a very powerful weapon for scraping web pages. We saw how XmlSlurper is enough for scraping simple pages when no interaction is needed. But things normally get complicated.

Geb can use your favorite browser engine underneath to scrap really complicated pages with javascript, css... etc. But you if you like you can also use HtmlUnit like your browser engine. This way you can execute your script in any environment (No need to install any browser in every environment). Said so, it's also true that HtmlUnit has some limitations, specially if the page you are scraping used javascript in a very intensive way.

Geb uses a really simple DSL and a "JQuery like" syntax (with the $ character) for traversing html pages. This could be very handy sometimes when trying to look for a given DOM node. If you know how to get it with JQuery the solution with Gen will be pretty the same (read the documentation anyway).

As this is a really short example there's no need of higher abstraction, just getting a page, traversing the page, trying to get the relevant elements. But when trying to do the same stuff through several pages it's gonna be a mess. Geb solves those situations with the "Page Object Pattern".

The following code goes to the Grails plugin portal and looks for plugins related to the "Geb" term. For this example I've used latest Geb release (0.7.0). It's just a groovyConsole script, you can copy&paste and then run it (Dependencies are declared with Grapes).

@Grapes([  
    @Grab("org.codehaus.geb:geb-core:0.7.0"),
    @Grab("org.seleniumhq.selenium:selenium-htmlunit-driver:2.23.1"),
    @Grab("org.seleniumhq.selenium:selenium-support:2.23.1")
])
import geb.Browser

Browser.drive{
    // driver.webClient.javaScriptEnabled = true
    go "http://grails.org/plugins/"      
                
    $("input", name: "q").value("geb")
    $("input", class: "searchButton").click()   
             
    waitFor(10){ $("div", class:'currentPlugin') }
        
    def pluginNames = $("div", class:'currentPlugin').collect{div->
        div.find("h4").find("a").text()    
    }
    
    println pluginNames
            
}

Start the script inside the DSL and go to the page we want to start our scraping work. This time I've disabled the javascript interaction because form submission can work without javascript.

Browser.drive{  
 // driver.webClient.javaScriptEnabled = true
go "http://grails.org/plugins/" 
//...
}

Look for relevant elements from the form. Then put some value inside the input text and click the form button.

$("input", name: "q").value("geb")
    $("input", class: "searchButton").click()

In this moment the "browser" is sending the request and getting the response. We want to be sure the response has been received before continuing asking. In this case I waited for 10 seconds the given div to show up (by default is 5 seconds).

waitFor(10){ $("div", class:'currentPlugin') }

Once the page is ready we want to get the list of the retrieved plugins' names

def pluginNames = $("div", class:'currentPlugin').collect{div->
        div.find("h4").find("a").text()    
    }
    
    println pluginNames

Bottom line, I would use Geb :

For testing
For scraping really complicated pages in an elegant way thanks to its DSL

Resources:

Scraping with Groovy (I): XmlSlurper

There're times when a man have to scrap web pages, and when those times come... you'd better have Groovy.

Here I'm gonna write about some things I've learned about scraping with Groovy. I will use core functionalities like XmlSlurper and GPath, and some more advanced tools like Geb.

Right now I'm following the UEFA EURO 2012 (Go Spain!!) and I want to get the final list of the teams playing the quarter finals. I already now the url so I just have to parse the page and take the list of the first table. In this post I'm gonna explain how to use XmlSlurper to do so.

Time to analysis:

First thing I have to do is to locate the html where the results are. Well, activate your firebug (Firefox) or "Inspect Element" on Safari, and inspect the html. In this case we have the following structure:

(1) All phases are group inside div elements with class "box boxSimple" so I need the first
(2) Inside that div we have to look for a list (ul element)
(3) Every element in the list is assigned to a team. I need the name of each team. They are located inside the attribute @title inside the first anchor available inside each element of the list.

Before coding:

We are gonna use XmlSluper to parse an HTML page, but there's one problem XmlSlurper only parses XML. Don't worry, to fix that you need the NekoHtml parser. It converts an HTML page into an XHTML page. Then you'll be able to parse the page and retrieve the information we are looking for.

Let's code:

I've used groovyConsole to do it. Pay attention to the dependency.

@Grapes([
    @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
])
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper

/* Getting the xhtml page thanks to Neko SAX parser */
def imdbPageXml = new XmlSlurper(new SAXParser()).
    parse("http://www.uefa.com/uefaeuro/season=2012/teams/index.html")    
    
def teams = 
 /* (1) Go to first <div class='box boxSimple'> */
    imdbPageXml.'**'.find{it.name() == 'DIV' && it.@class=='box boxSimple'}.
    /* (2) Folow the path */
        DIV[1].UL.LI.collect{li-> 
         /* (3) For each element in list get the title of the first anchor */
            li.'**'.find{it.name() == 'A'}*.@title
        }.flatten()

println teams

Once XmlSlurper returns a navigable instance of GPathResult the object can be traversed using GPath:

Go to first div with class 'box boxSimple'

imdbPageXml.'**'.find{it.name() == 'DIV' && it.@class=='box boxSimple'}.

To avoid to put explicit all the way to the element I'm looking for I started using '**' and then a query. That line means "Look everywhere in the document and take the first element with name div and having the attribute class with value 'box boxSimple'. Remember '**' could be really useful.

I use find instead of findAll to get only the first element. Also notice all HTML elements are referred in upper case.

Folow the path to the elements of the list

DIV[1].UL.LI.collect{li->

This path is very short and I didn't know how to do it shorter, so I put the path explicitly.

For each element get the title of the first anchor

li.'**'.find{it.name() == 'A'}*.@title

"Everywhere inside this 'li' element look for an element with name 'a'. Then get its @title attribute value.

Bottom line, use XmlSlurper for scraping when:

If you're scraping WebServices returning well formed XML
If you're scraping simple pages and you're not going to interact with the page. Then use XmlSluper with the Neko parse to ensure the page conversion to XML.

But:

If you're going to interact with the page and still want to do it with Groovy then I'd recommend you to use Geb

Resources: