Monday, 1 April 2013

Groovy Xml Series: Querying Xml with GPath

The most common way of querying XML in Groovy is using GPath. The entry from the official page:

"GPath is a path expression language integrated into Groovy which allows parts of nested structured data to be identified. In this sense, it has similar aims and scope as XPath does for XML. The two main places where you use GPath expressions is when dealing with nested POJOs or when dealing with XML"

So it's similar to XPath expressions and you can use it not only with XML but also with POJO classes. Ok, so lets begin.

Given the following xml:

 
                                                                                                               
                       
                       
                   
                   Don Xijote       
                   Manuel De Cervantes
                       
                   
                   Catcher in the Rye
                  JD Salinger
                      
                  
                  Alice in Wonderland
                  Lewis Carroll
                      
                   
                  Don Xijote       
                  Manuel De Cervantes
              
          
                     
   

Node's text content


First thing we are going to do is to get a value using POJO's notation. Lets get the first book's author's name (Code is available at Github.).

      def "Using POJO notation: Getting a node using POJOs notation a.b.c"(){
          setup: "Parsing the document"
              def response = new XmlSlurper().parse(xmlFile) 
          when: "Trying to get a given node using the a.b.c notation"
              def authorNode = response.value.books.book[0].author
          then: "We can check the author's value"
              authorNode.text() == 'Manuel De Cervantes'
      }

So first we parse the document with XmlSlurper (The xmlFile is a variable of type java.io.File) and the we have to consider the returning value as the root of the XML document, so in this case is "response".

So that's why we start traversing the document from response and then value.books.book[0].author. Note that in XPath the node arrays starts in [1] instead of [0], but because GPath is Java-based it starts in [0] index.

GPathResult (XmlSlurper) and Node (XmlParser)


In the end we'll have the instance of the "author" node and because we wanted the text inside that node we are going to call the text() method.  The "author" node is an instance of GPathResult type and text() a method giving us the content of that node as a String.

When using GPath with an xml parsed with XmlSlurper we'll have as a result a GPathResult object. GPathResult has many other convenient methods to convert the text inside a node to any other type such as:

  • toInteger()
  • toFloat()
  • toBigInteger()
  • ...
All these methods try to convert an String to a certain type.

If we were using a XML parsed with XmlParser we could be dealing with instances of type Node. But still all the actions applied to GPathResult in these examples could be applied to a Node as well. Creators of both parsers took into account GPath compatibility.

Attribute's content


Next step is to get the some values from a given node's attribute. In the following sample we want to get the first book's author's id. We'll be using two different approaches. Let's see the code first:


         def "Using POJO notation: Getting an attribute's value using POJOs notation a.b.c"(){
          setup: "Parsing the document"
              def response = new XmlSlurper().parse(xmlFile) 
          when: "Trying to get a given node using the a.b.c notation"
              def firstBook = response.value.books.book[0]
              def firstAuthorIdNode1 = firstBook.author.@id
              def firstAuthorIdNode2 = firstBook.author['@id']
          then: "Getting the id's value"
              firstAuthorIdNode1.toInteger() == 1
              firstAuthorIdNode2.toInteger() == 1
      }

Again we first parse the document and then using the POJO's notation we get the first book node. Now take a look at the first expression:
  • firstBook.author.@id
  • firstBook.author['@id']

I specially like the former type of notation because is more straight forward, and meaningful.  The latter is more like using an instance of a map (which I guess it should be eventually).

Speeding things up: "breadfirst()" and "depthfirst()"


If you ever have used XPath you have been using the expressions like
  • "//" : Look everywhere
  • "/following-sibling::othernode" : Look for a node "othernode" in the same level

More or less we have their conterparts in Gpath with the methods breadfirst() and depthfirst(). The first example shows a simple use of breadfirst(). The creators of this methods created a shorter syntax for it using '*'.

        def "Using '*': Getting a node using breadthFirst operator '*'"(){                                                              
          setup: "Parsing the document"
              def response = new XmlSlurper().parse(xmlFile)
          when: "Looking for the node having the name 'book'"
          and: "with attribute id equals to 2"
           /* You can use the breadthFirst operator to look among a group 
              of nodes at the same level */
              def catcherInTheRye = response.value.books.'*'.find{node-> 
               /* node.@id == 2 could be expressed as node['@id'] == 2 */
                  node.name() == 'book' && node.@id == '2'
              }
          then: "Getting the author's value"
              catcherInTheRye.title.text() == 'Catcher in the Rye'
      }

This Spock specification looks for any node at the same level as "books" node first, and only if it couldn't find the node we were looking for then it will look deeper in the tree, always taking into account the given the expression inside the closure.

That expression says "Look for any node with a tag name equals 'book' and having an id with a value of '2'".

Today I woke up very lazy and I'd like to look for a given value without caring where it might be. The only thing I know is that I need the id of the author "Lewis Carroll" . How do I do that? using depthFirst()

        def "Using '**': Getting a node using depthFirst operator '**'"(){
          setup: "parsing the document"
              def response = new XmlSlurper().parse(xmlFile)                                                                              
          when: "Using the deptFirst operator we can look for something"
          and: "it doesn't matter how deep the node is"
          and: "Let's say we want to look for the book's id of the book written by Lewis Carrol"
           /* Beware of the name I used for the closure's parameter. It may look like 
              the ** is too smart, but it isn't. It's just that I'm sure only books will 
              match the query. To avoid any confusion I'd rather use 'node' */
              def bookId = response.'**'.find{book->
                  book.author.text() == 'Lewis Carroll'
              }.@id
          then: "The bookId should be 3"
              bookId == "3"
      }

Definitely is shorter that using the POJO notation isn't it? depthfirst() is the same as looking something "everywhere in the tree from this point down". In this case we've used the method find(Closure cl) to find just the first occurrence.

What if we want to collect all book's titles?


      
     def "Using '**': Collecting all titles"(){
          setup: "parsing the document"
              def response = new XmlSlurper().parse(xmlFile)
          when: "Looking for all titles within the document"
              def titles = response.'**'.findAll{node-> node.name() == 'title'}*.text()
          then: "There should be only four"
              titles.size() == 4
      }

I've mentioned there are some useful methods that convert a node's value to an integer,float...etc. Those methods could be convenient when doing comparisons like this:

      def "Using findAll: Collecting all titles"(){
          setup: "parsing the document"
              def response = new XmlSlurper().parse(xmlFile)
          when: "Looking for all titles with an id greater than 2"
              def titles = response.value.books.book.findAll{book->
               /* You can use toInteger() over the GPathResult object */
                  book.@id.toInteger() > 2
              }*.title
          then: "There should be only two"
              titles.size() == 2
      }

In this case the number 2 has been hardcoded but imagine that value could have come from any other source (Gorm id's...etc)

Resources


3 comments:

  1. Hi, what if I wanted to put a part of the GPath as a variable?

    i.e.:
    ==============================================================================
    def gpathPiece = '**'.find{book-> book.author.text() == 'Lewis Carroll'}.@id

    def bookId = response.gpathPiece //This part is not working for me
    System.out.println(bookId)
    ==============================================================================
    I am trying to use a variable to represent that piece of the GPath, but it doesn't work. Confirmed that when I put that variable's value directly back into the GPath, it works.

    Any suggestions? Thanks

    ReplyDelete
  2. Thanks for the post,Really you given a valuable information on xml.worth to read this type of articles .
    Thank you.
    oracle R12 training

    ReplyDelete
  3. I have read your blog its very attractive and impressive. I like it your blog.

    Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

    Java Online Training Java Online Training JavaEE Training in Chennai Java EE Training in Chennai

    ReplyDelete