You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Paul Tarjan <pa...@paulisageek.com> on 2009/11/04 21:07:53 UTC

Python CSV record reader

I wrote a CSV Jute record parser in python, and thought some people on
the list might also be interested.

http://github.com/ptarjan/hadoop_record

You can use it in your streaming jobs with

-inputformat SequenceFileAsTextInputFormat -file hadoop_record.mod

And just showing some features:

>>> from hadoop_record import csv
>>> csv("T")
True
>>> csv(";-1234")
-1234
>>> csv("1.0E-10")
1e-10
>>> csv("s{T,F}")
[True, False]
>>> csv("v{T,F}")
[True, False]
>>> csv("v{s{T,F}}")
[[True, False]]
>>> csv("m{'don't,#73746f70}")
{LazyString("don't"): LazyString('stop')}
>>> csv("'\xe2\x98\x83")
LazyString('\xe2\x98\x83')
>>> str(csv("'\xe2\x98\x83"))
'\xe2\x98\x83'
>>> unicode(csv("'\xe2\x98\x83"))
u'\u2603'
>>> csv("'%00%0a%25%2c")
LazyString('\x00\n%,')

The LazyString was needed because I was spending most of my CPU just
decoding data from the Jute record that I didn't care about. It
shouldn't get in your way too much, as long as you cast it to a str
first.

So let me know what you think. For bugs, fork, fix, and then send me a
pull request (or use the issues tracker).

Paul