You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by wd <wd...@wdicc.com> on 2009/11/10 04:54:54 UTC

Hadoop streaming job issue

hi,

I'm try to write a hadoop streaming job by perl. But i'm complately confused
by the key/value separator.

I found lots of separators I can set ...

# -jobconf stream.map.output.field.separator=A \
# -jobconf stream.reducer.output.field.separator=B \
# -jobconf mapred.textoutputformat.separator=C \
# -jobconf key.value.separator.in.input.line=D \
# -jobconf stream.map.output.field.separator=A \
# -jobconf stream.reduce.input.field.separator=AA \
# -jobconf stream.reduce.output.field.separator=B \
# -jobconf map.output.key.field.separator=C \

But what does these separators mean?

I try to use ^A in my job, and find this
bug<http://issues.apache.org/jira/browse/HADOOP-3341>, it seems hadoop
have fix it in 0.19.0, but I still get follow error when I
set to ^A.

[Fatal Error] :49:68: Character reference "&#1" is an invalid XML character.
09/11/10 11:10:16 FATAL conf.Configuration: error parsing conf file:
org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
character.
Exception in thread "main" java.lang.RuntimeException:
org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
character.
    at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1167)
    at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1039)
    at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
    at org.apache.hadoop.conf.Configuration.get(Configuration.java:381)
    at
org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1630)
    at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:214)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
    at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:372)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at
org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:873)
    at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:118)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at
org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.xml.sax.SAXParseException: Character reference "&#1" is an
invalid XML character.
    at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
    at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
    at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1091)
    ... 19 more

So, I can't use ^A as the separator ?

Re: Hadoop streaming job issue

Posted by wd <wd...@wdicc.com>.

Oh, very thanks, I found the picture.
thks

2009/11/17 Jason Venner <ja...@gmail.com>

> There is a very clear picture in chapter 8 of pro hadoop, on all of the
> separators for streaming jobs.
>
>
>
> On Tue, Nov 10, 2009 at 6:53 AM, wd <wd...@wdicc.com> wrote:
>
>> You mean the ^A ?
>> I tried \u0001 and \x01, the streaming job recognise it as a string, not
>> ^A..
>>
>> :(
>>
>> 2009/11/10 Amogh Vasekar <am...@yahoo-inc.com>
>>
>>  Hi,
>>> I’m pretty sure you need to specify unicode equivalent, or atleast that
>>> is what I used in my java map-red program.
>>>
>>> Amogh
>>>
>>>
>>>
>>> On 11/10/09 9:24 AM, "wd" <wd...@wdicc.com> wrote:
>>>
>>> hi,
>>>
>>> I'm try to write a hadoop streaming job by perl. But i'm complately
>>> confused by the key/value separator.
>>>
>>> I found lots of separators I can set ...
>>>
>>> # -jobconf stream.map.output.field.separator=A \
>>> # -jobconf stream.reducer.output.field.separator=B \
>>> # -jobconf mapred.textoutputformat.separator=C \
>>> # -jobconf key.value.separator.in.input.line=D \
>>> # -jobconf stream.map.output.field.separator=A \
>>> # -jobconf stream.reduce.input.field.separator=AA \
>>> # -jobconf stream.reduce.output.field.separator=B \
>>> # -jobconf map.output.key.field.separator=C \
>>>
>>> But what does these separators mean?
>>>
>>> I try to use ^A in my job, and find this bug <
>>> http://issues.apache.org/jira/browse/HADOOP-3341>  , it seems hadoop
>>> have fix it in 0.19.0, but I still get follow error when I set to ^A.
>>>
>>>
>>> [Fatal Error] :49:68: Character reference "&#1" is an invalid XML
>>> character.
>>> 09/11/10 11:10:16 FATAL conf.Configuration: error parsing conf file:
>>> org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
>>> character.
>>> Exception in thread "main" java.lang.RuntimeException:
>>> org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
>>> character.
>>>     at
>>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1167)
>>>     at
>>> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1039)
>>>     at
>>> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
>>>     at org.apache.hadoop.conf.Configuration.get(Configuration.java:381)
>>>     at
>>> org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1630)
>>>     at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:214)
>>>     at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
>>>     at
>>> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:372)
>>>     at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
>>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>     at
>>> org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:873)
>>>     at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:118)
>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>     at
>>> org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>     at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>     at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>> Caused by: org.xml.sax.SAXParseException: Character reference "&#1" is an
>>> invalid XML character.
>>>     at
>>> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
>>>     at
>>> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
>>>     at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
>>>     at
>>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1091)
>>>     ... 19 more
>>>
>>> So, I can't use ^A as the separator ?
>>>
>>>
>>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: Hadoop streaming job issue

Posted by Jason Venner <ja...@gmail.com>.

There is a very clear picture in chapter 8 of pro hadoop, on all of the
separators for streaming jobs.


On Tue, Nov 10, 2009 at 6:53 AM, wd <wd...@wdicc.com> wrote:

> You mean the ^A ?
> I tried \u0001 and \x01, the streaming job recognise it as a string, not
> ^A..
>
> :(
>
> 2009/11/10 Amogh Vasekar <am...@yahoo-inc.com>
>
>  Hi,
>> I’m pretty sure you need to specify unicode equivalent, or atleast that is
>> what I used in my java map-red program.
>>
>> Amogh
>>
>>
>>
>> On 11/10/09 9:24 AM, "wd" <wd...@wdicc.com> wrote:
>>
>> hi,
>>
>> I'm try to write a hadoop streaming job by perl. But i'm complately
>> confused by the key/value separator.
>>
>> I found lots of separators I can set ...
>>
>> # -jobconf stream.map.output.field.separator=A \
>> # -jobconf stream.reducer.output.field.separator=B \
>> # -jobconf mapred.textoutputformat.separator=C \
>> # -jobconf key.value.separator.in.input.line=D \
>> # -jobconf stream.map.output.field.separator=A \
>> # -jobconf stream.reduce.input.field.separator=AA \
>> # -jobconf stream.reduce.output.field.separator=B \
>> # -jobconf map.output.key.field.separator=C \
>>
>> But what does these separators mean?
>>
>> I try to use ^A in my job, and find this bug <
>> http://issues.apache.org/jira/browse/HADOOP-3341>  , it seems hadoop have
>> fix it in 0.19.0, but I still get follow error when I set to ^A.
>>
>>
>> [Fatal Error] :49:68: Character reference "&#1" is an invalid XML
>> character.
>> 09/11/10 11:10:16 FATAL conf.Configuration: error parsing conf file:
>> org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
>> character.
>> Exception in thread "main" java.lang.RuntimeException:
>> org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
>> character.
>>     at
>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1167)
>>     at
>> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1039)
>>     at
>> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
>>     at org.apache.hadoop.conf.Configuration.get(Configuration.java:381)
>>     at
>> org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1630)
>>     at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:214)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:372)
>>     at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>     at
>> org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:873)
>>     at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:118)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>     at
>> org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> Caused by: org.xml.sax.SAXParseException: Character reference "&#1" is an
>> invalid XML character.
>>     at
>> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
>>     at
>> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
>>     at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
>>     at
>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1091)
>>     ... 19 more
>>
>> So, I can't use ^A as the separator ?
>>
>>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: Hadoop streaming job issue

Posted by wd <wd...@wdicc.com>.

You mean the ^A ?
I tried \u0001 and \x01, the streaming job recognise it as a string, not
^A..

:(

2009/11/10 Amogh Vasekar <am...@yahoo-inc.com>

>  Hi,
> I’m pretty sure you need to specify unicode equivalent, or atleast that is
> what I used in my java map-red program.
>
> Amogh
>
>
>
> On 11/10/09 9:24 AM, "wd" <wd...@wdicc.com> wrote:
>
> hi,
>
> I'm try to write a hadoop streaming job by perl. But i'm complately
> confused by the key/value separator.
>
> I found lots of separators I can set ...
>
> # -jobconf stream.map.output.field.separator=A \
> # -jobconf stream.reducer.output.field.separator=B \
> # -jobconf mapred.textoutputformat.separator=C \
> # -jobconf key.value.separator.in.input.line=D \
> # -jobconf stream.map.output.field.separator=A \
> # -jobconf stream.reduce.input.field.separator=AA \
> # -jobconf stream.reduce.output.field.separator=B \
> # -jobconf map.output.key.field.separator=C \
>
> But what does these separators mean?
>
> I try to use ^A in my job, and find this bug <
> http://issues.apache.org/jira/browse/HADOOP-3341>  , it seems hadoop have
> fix it in 0.19.0, but I still get follow error when I set to ^A.
>
>
> [Fatal Error] :49:68: Character reference "&#1" is an invalid XML
> character.
> 09/11/10 11:10:16 FATAL conf.Configuration: error parsing conf file:
> org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
> character.
> Exception in thread "main" java.lang.RuntimeException:
> org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML
> character.
>     at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1167)
>     at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1039)
>     at
> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
>     at org.apache.hadoop.conf.Configuration.get(Configuration.java:381)
>     at
> org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1630)
>     at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:214)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
>     at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:372)
>     at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>     at
> org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:873)
>     at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:118)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>     at
> org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: org.xml.sax.SAXParseException: Character reference "&#1" is an
> invalid XML character.
>     at
> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
>     at
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
>     at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
>     at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1091)
>     ... 19 more
>
> So, I can't use ^A as the separator ?
>
>

Re: Hadoop streaming job issue

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
I'm pretty sure you need to specify unicode equivalent, or atleast that is what I used in my java map-red program.

Amogh


On 11/10/09 9:24 AM, "wd" <wd...@wdicc.com> wrote:

hi,

I'm try to write a hadoop streaming job by perl. But i'm complately confused by the key/value separator.

I found lots of separators I can set ...

# -jobconf stream.map.output.field.separator=A \
# -jobconf stream.reducer.output.field.separator=B \
# -jobconf mapred.textoutputformat.separator=C \
# -jobconf key.value.separator.in.input.line=D \
# -jobconf stream.map.output.field.separator=A \
# -jobconf stream.reduce.input.field.separator=AA \
# -jobconf stream.reduce.output.field.separator=B \
# -jobconf map.output.key.field.separator=C \

But what does these separators mean?

I try to use ^A in my job, and find this bug <http://issues.apache.org/jira/browse/HADOOP-3341>  , it seems hadoop have fix it in 0.19.0, but I still get follow error when I set to ^A.

[Fatal Error] :49:68: Character reference "&#1" is an invalid XML character.
09/11/10 11:10:16 FATAL conf.Configuration: error parsing conf file: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character.
    at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1167)
    at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1039)
    at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
    at org.apache.hadoop.conf.Configuration.get(Configuration.java:381)
    at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1630)
    at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:214)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:93)
    at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:372)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:873)
    at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:118)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
    at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1091)
    ... 19 more

So, I can't use ^A as the separator ?