You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bradford Stephens <br...@gmail.com> on 2008/05/22 03:35:51 UTC

Avoiding Newline Problems in Hadoop Streaming + StreamXMLRecordReader

Greetings,

I have an interesting problem I'm trying to solve. I currently store a bunch
of webpages in a large XML file in Hadoop. I'm trying to parse information
out of these webpages using a complex C# program that I have running on Mono
(I'm in a Linux environment). Therefore, I'm using Hadoop Streaming and the
StreamXMLRecordReader in order to get the information to my C# parser. The
problem is that even wrapped in XML, the Hadoop Streaming ends the records
at newlines! This makes the map input data pretty useless. Does anyone have
any hints on how to get around this?

Here's the XML structure I'm trying to use:

<ContentRecord><RecordURL>http://www.blah</RecordURL><PageContent><![CDATA[page
text would be here including newlines ]]></PageContent></ContentRecord>

Any ideas?

Cheers,
Bradford