You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "David Campbell (JIRA)" <ji...@apache.org> on 2008/05/29 20:47:45 UTC
[jira] Created: (HADOOP-3465)
org.apache.hadoop.streaming.StreamXmlRecordReader
org.apache.hadoop.streaming.StreamXmlRecordReader
-------------------------------------------------
Key: HADOOP-3465
URL: https://issues.apache.org/jira/browse/HADOOP-3465
Project: Hadoop Core
Issue Type: Bug
Components: contrib/streaming
Affects Versions: 0.17.0
Environment: java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Client VM (build 10.0-b22, mixed mode, sharing)
Linux hadoop-master 2.6.24.7-92.fc8 #1 SMP Wed May 7 16:50:09 EDT 2008 i686 i686 i386 GNU/Linux
hadoop-0.17.0
Reporter: David Campbell
Fix For: 0.17.0
I downloaded and installed the 0.17.0 version this morning.
I'm trying to use the StreamXmlRecordReader to parse a file that is formatted like this:
<results>
<row>
<FIELD1>value</FIELD1>
..... many fields.
</row>
</results>
Each logical row has about 1,371 characters in it.
I have the following settings in my job.
conf.set("stream.recordreader.begin", "<row>");
conf.set("stream.recordreader.end", "</row>");
conf.set("stream.recordreader.maxrec", "500000");
When I run my tests, the TaskTracker shows me a severely truncated row like this:
Processing record=<row>
<FIELD1><![CDATA[]]></FIELD2>
<FIELD2><![CDATA[TL]]></FIELD2>
<FIELD3><![CDATA[0003779]]></FIELD3>
<FIELD4><![CDATA[ABCD]]></FIELD4>
I've tried setting the maxrec limits but even the default should be (as I read the code) more than big enough to handle ~1,371 characters from <row> to </row>.
And as you might expect, the XML parser in my Mapper task blows up because most of the <row> </row> is missing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HADOOP-3465)
org.apache.hadoop.streaming.StreamXmlRecordReader
Posted by "David Campbell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Campbell resolved HADOOP-3465.
------------------------------------
Resolution: Cannot Reproduce
I was able to get the full record. I spent some more time reading the code and figured out what changes needed to make to my code.
> org.apache.hadoop.streaming.StreamXmlRecordReader
> -------------------------------------------------
>
> Key: HADOOP-3465
> URL: https://issues.apache.org/jira/browse/HADOOP-3465
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/streaming
> Affects Versions: 0.17.0
> Environment: java version "1.6.0_06"
> Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
> Java HotSpot(TM) Client VM (build 10.0-b22, mixed mode, sharing)
> Linux hadoop-master 2.6.24.7-92.fc8 #1 SMP Wed May 7 16:50:09 EDT 2008 i686 i686 i386 GNU/Linux
> hadoop-0.17.0
> Reporter: David Campbell
> Fix For: 0.17.0
>
>
> I downloaded and installed the 0.17.0 version this morning.
> I'm trying to use the StreamXmlRecordReader to parse a file that is formatted like this:
> <results>
> <row>
> <FIELD1>value</FIELD1>
> ..... many fields.
> </row>
> </results>
> Each logical row has about 1,371 characters in it.
> I have the following settings in my job.
> conf.set("stream.recordreader.begin", "<row>");
> conf.set("stream.recordreader.end", "</row>");
>
> conf.set("stream.recordreader.maxrec", "500000");
> When I run my tests, the TaskTracker shows me a severely truncated row like this:
> Processing record=<row>
> <FIELD1><![CDATA[]]></FIELD2>
> <FIELD2><![CDATA[TL]]></FIELD2>
> <FIELD3><![CDATA[0003779]]></FIELD3>
> <FIELD4><![CDATA[ABCD]]></FIELD4>
> I've tried setting the maxrec limits but even the default should be (as I read the code) more than big enough to handle ~1,371 characters from <row> to </row>.
> And as you might expect, the XML parser in my Mapper task blows up because most of the <row> </row> is missing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.