Monday, 22 April 2013

Some ideas about processing a text file with Groovy

I really hated Java when reading a file. At least when using BufferedReader you could always abstract reading every line with readLine(), and now with JDK7 you can even forget about taking care of closing the stream (related article at DZone), great!! But...Can Groovy do better? I hope so :P

A couple of days ago I was looking into Groovy's java.util.File JDK documentation trying to find out the easiest way to iterate through a text file collecting information on the way.

I started doing my research with the following data:

JANUARY     100.23
FEBRUARY    23.34
MARCH       45.56
APRIL       67
MAY         78.2
JUNE        23.3
JULY        92.2
AUGUST      802.2
SEPTEMBER   87.3
OCTOBER     2.2
NOVEMBER    3.2
DECEMBER    150.4

The research


Firstly I started looping through every line in the file using the Java approach:

def file2Process = new File("/pathtofile/file.txt")
def acc = 0
def javaReader = file2Process.newReader()
while((next = javaReader.readLine()) != null){     
    acc += next.split(/\s{1,}/)?.getAt(1)?.toDouble()?:0
}

javaReader.close()
println "Java (1)->$acc"

Not bad, BufferedReader is a nice abstraction but the while statement has too many things, a declaration, assignation, and a condition altogether.

Afterwards, I don't know where it came from, I tried to iterate the reader returned from newReader() method (added by Groovy to the File class). As you may guess It looked nicer:
def reader = file2Process.newReader()
def perLine = {line-> line.split(/\s{1,}/)?.getAt(1)?.toDouble()?:0 }

def totalAmount = reader.collect(perLine).sum()

reader.close()

println "Groovy (1)->$totalAmount"

So I guess all readers are iterable, Aren't they? I don't remember they were in Java, or maybe I just forgot it. But still I had to close the reader explicitly. Common! JDK7 already do that, you can do it better! Don't you?

Yes, and it was my fault, I didn't notice about the withXXX methods added to the java.util.File class in Groovy. These methods receive a closure as parameter. Inside that closure you can use the underlying reader or stream and at the end of the closure the method takes care about closing the reader/stream.

def perLine = {line-> line.split(/\s{1,}/)?.getAt(1)?.toDouble()?:0 }
def result = file2Process.withReader{r->
    r.collect(perLine).sum()
}

println "Groovy (2)->$result"
The good thing about this is that you can still use it with JDK6 (However I encourage you to move to JDK7).

The "real" mission


The task that I needed to do was to process all lines of a text file gathering different chunks of information.

Also motivated by the rant around "functional vs OOP" I wanted to process the file in a "functional" way, which means not to use temporary variables outside the scope of the closure (aka no side effects).

That's why I found really impressive the idea that readers could be iterable. If readers were iterable I would be able to use collection methods like "inject" to populate a map with different types of information collected along the file.

Of course the file I'm showing here has little to do with the real one, with thousands of lines (and dozens of fields) coming in from a legacy Cobol system, but it works for the shake of the explanation.

So lets say I wanted to return a map with different values
  • Total number of lines
  • Total amount
  • Total amount by quarter

I came up with this solution:

def firstQ = ['JAN','FEB','MAR','APRIL']
def secondQ = ['MAY','JUN','JUL','AUG']
def thirdQ = ['SEP','OCT','NOV','DEC']

def data = file2Process.withReader{r->
    r.inject([q1:0,q2:0,q3:0,lines:0,total:0]){map,val->
        def lineInfo = val.split(/\s{1,}/)
        def month = lineInfo?.getAt(0)?.take(3)
        def amount = lineInfo?.getAt(1)?.toBigDecimal()
        switch(month){
            case firstQ:
               map.q1 += amount
            break
            case secondQ:
               map.q2 += amount
            break
            case thirdQ:
               map.q3 += amount
            break
        }
        map.total += amount
        map.lines++
     /* Don't forget to return the map */
        map
    }
}

Which returns the map:
[q1:169.13, q2:995.9, q3:243.1, lines:12, total:1475.13]

One complaint

  • Because in the real file I had to do some validations that involved the use of financial algorithms, I missed Gpars for that. But I wasn't able to make Gpars' foldParallel(..) method to work the same way as the inject(...) method does. I've sent the question to the mailing list and I'll update the entry as soon as I get an answer.

Resources

Take a look at how to transform a given file's lines with transformLine(...) method, it sprang to mind that maybe if I had to pre-process a given file I'd require that type of behavior sometime.


No comments:

Post a Comment