You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Paul Ingles <pa...@oobaloo.co.uk> on 2010/01/05 14:27:55 UTC

StreamXmlRecordReader

Hi,

We're looking to convert some Ruby/C libxml XML processing code over  
to Hadoop. Currently reports are transformed into a CSV output that is  
then easier to consume for the downstream systems. We already use  
Hadoop (streaming) quite extensively for the rest of our daily batches  
so we'd like to make more of our cluster resources.

The XML files are pretty straightforward. The elements we're  
interested in look similar to this:

<row attr1="val1" attr2="val2"></row>

There are perhaps 15-20 attributes per 'row' element.

I've tried running a streaming job, set to the use the  
StreamXmlRecordReader (via the -inputreader parameter) but it seems to  
split at weird boundaries: the input file is ~450MB in size. Running a  
simple job to map just the key (ie. just keep the row content)  
generates multiple-gigabytes of output.

I think the problem is related to the fact that the files are  
generally only 3 lines long: XML declaration, empty line, finally all  
the rows in one line. My understanding of the RecordReaders is that  
they should understand how to interpret a split, and map each record  
in those splits. When a record spans splits, they are able to retrieve  
the data from the other splits?

Do I need to write a custom InputFormat to perform splits that honour  
the record boundaries?

Thanks,
Paul