You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andreas Kostyrka <an...@kostyrka.org> on 2008/03/18 22:17:56 UTC

streaming problem

Hi!

I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
scripts to be used to all nodes:

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper ~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output

Now, this gives me:

java.io.IOException: log:null
R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Tue Mar 18 21:06:13 GMT 2008
java.io.IOException: Broken pipe
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:260)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)


	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)

Any ideas what my problems could be?

TIA,

Andreas

Re: streaming problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.
Ok, tracked it down. Seems like Hadoop Streaming "corrupts" the input
files. Any way to force it to pass whole files to one-to-one mapper?

TIA,

Andreas

Am Mittwoch, den 19.03.2008, 09:18 +0100 schrieb Andreas Kostyrka:
> The /home/hadoop/dist/workloadmf script is available on all nodes.
> 
> But it missed one package to run correctly ;(
> 
> Anyway, I still have the problem, that running with
> -reducer NONE, my output gets lost, it seems. Well, some of the
> outputfiles contain a small number of output lines, but not many :(
> (And the expected size of each output file was around 25MB or so :( )
> 
> Ah the joys,
> 
> Andreas
> 
> Am Mittwoch, den 19.03.2008, 10:13 +0530 schrieb Amareshwari
> Sriramadasu:
> > Hi Andreas,
> >  Looks like your mapper is not available to the streaming jar. Where is 
> > your mapper script? Did you use distributed cache to distribute the mapper?
> > You can use -file <mapper-script-path on local fs> to make it part of 
> > jar. or Use -cacheFile /dist/wordloadmf#workloadmf to distribute the 
> > script. Distributing this way will add your script to the PATH.
> > 
> > So, now you command will be:
> > 
> > time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf -reducer NONE -input testlogs/* -output testlogs-output -cacheFile /dist/wordloadmf#workloadmf
> > 
> > or
> > 
> > time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf -reducer NONE -input testlogs/* -output testlogs-output -file <path-on-local-fs>
> > 
> > Thanks,
> > Amareshwari
> > 
> > Andreas Kostyrka wrote:
> > > Some additional details if it's helping, the HDFS is hosted on AWS S3,
> > > and the input file set consists of 152 gzipped Apache log files.
> > >
> > > Thanks,
> > >
> > > Andreas
> > >
> > > Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
> > >   
> > >> Hi!
> > >>
> > >> I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
> > >> scripts to be used to all nodes:
> > >>
> > >> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper ~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output
> > >>
> > >> Now, this gives me:
> > >>
> > >> java.io.IOException: log:null
> > >> R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
> > >> minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
> > >> HOST=null
> > >> USER=hadoop
> > >> HADOOP_USER=null
> > >> last Hadoop input: |null|
> > >> last tool output: |null|
> > >> Date: Tue Mar 18 21:06:13 GMT 2008
> > >> java.io.IOException: Broken pipe
> > >> 	at java.io.FileOutputStream.writeBytes(Native Method)
> > >> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> > >> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> > >> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> > >> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> > >> 	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > >> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> > >> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> > >> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> > >>
> > >>
> > >> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> > >> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> > >> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> > >>
> > >> Any ideas what my problems could be?
> > >>
> > >> TIA,
> > >>
> > >> Andreas
> > >>     

Re: streaming problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.
The /home/hadoop/dist/workloadmf script is available on all nodes.

But it missed one package to run correctly ;(

Anyway, I still have the problem, that running with
-reducer NONE, my output gets lost, it seems. Well, some of the
outputfiles contain a small number of output lines, but not many :(
(And the expected size of each output file was around 25MB or so :( )

Ah the joys,

Andreas

Am Mittwoch, den 19.03.2008, 10:13 +0530 schrieb Amareshwari
Sriramadasu:
> Hi Andreas,
>  Looks like your mapper is not available to the streaming jar. Where is 
> your mapper script? Did you use distributed cache to distribute the mapper?
> You can use -file <mapper-script-path on local fs> to make it part of 
> jar. or Use -cacheFile /dist/wordloadmf#workloadmf to distribute the 
> script. Distributing this way will add your script to the PATH.
> 
> So, now you command will be:
> 
> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf -reducer NONE -input testlogs/* -output testlogs-output -cacheFile /dist/wordloadmf#workloadmf
> 
> or
> 
> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf -reducer NONE -input testlogs/* -output testlogs-output -file <path-on-local-fs>
> 
> Thanks,
> Amareshwari
> 
> Andreas Kostyrka wrote:
> > Some additional details if it's helping, the HDFS is hosted on AWS S3,
> > and the input file set consists of 152 gzipped Apache log files.
> >
> > Thanks,
> >
> > Andreas
> >
> > Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
> >   
> >> Hi!
> >>
> >> I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
> >> scripts to be used to all nodes:
> >>
> >> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper ~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output
> >>
> >> Now, this gives me:
> >>
> >> java.io.IOException: log:null
> >> R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
> >> minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
> >> HOST=null
> >> USER=hadoop
> >> HADOOP_USER=null
> >> last Hadoop input: |null|
> >> last tool output: |null|
> >> Date: Tue Mar 18 21:06:13 GMT 2008
> >> java.io.IOException: Broken pipe
> >> 	at java.io.FileOutputStream.writeBytes(Native Method)
> >> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> >> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> >> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> >> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> >> 	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> >> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> >> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> >> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> >>
> >>
> >> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> >> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> >> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> >>
> >> Any ideas what my problems could be?
> >>
> >> TIA,
> >>
> >> Andreas
> >>     

Re: streaming problem

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
Hi Andreas,
 Looks like your mapper is not available to the streaming jar. Where is 
your mapper script? Did you use distributed cache to distribute the mapper?
You can use -file <mapper-script-path on local fs> to make it part of 
jar. or Use -cacheFile /dist/wordloadmf#workloadmf to distribute the 
script. Distributing this way will add your script to the PATH.

So, now you command will be:

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf -reducer NONE -input testlogs/* -output testlogs-output -cacheFile /dist/wordloadmf#workloadmf

or

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf -reducer NONE -input testlogs/* -output testlogs-output -file <path-on-local-fs>

Thanks,
Amareshwari

Andreas Kostyrka wrote:
> Some additional details if it's helping, the HDFS is hosted on AWS S3,
> and the input file set consists of 152 gzipped Apache log files.
>
> Thanks,
>
> Andreas
>
> Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
>   
>> Hi!
>>
>> I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
>> scripts to be used to all nodes:
>>
>> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper ~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output
>>
>> Now, this gives me:
>>
>> java.io.IOException: log:null
>> R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
>> minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
>> HOST=null
>> USER=hadoop
>> HADOOP_USER=null
>> last Hadoop input: |null|
>> last tool output: |null|
>> Date: Tue Mar 18 21:06:13 GMT 2008
>> java.io.IOException: Broken pipe
>> 	at java.io.FileOutputStream.writeBytes(Native Method)
>> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
>> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
>> 	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
>>
>>
>> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
>> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
>>
>> Any ideas what my problems could be?
>>
>> TIA,
>>
>> Andreas
>>     


Re: streaming problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.
Some additional details if it's helping, the HDFS is hosted on AWS S3,
and the input file set consists of 152 gzipped Apache log files.

Thanks,

Andreas

Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
> Hi!
> 
> I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
> scripts to be used to all nodes:
> 
> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper ~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output
> 
> Now, this gives me:
> 
> java.io.IOException: log:null
> R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
> minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
> HOST=null
> USER=hadoop
> HADOOP_USER=null
> last Hadoop input: |null|
> last tool output: |null|
> Date: Tue Mar 18 21:06:13 GMT 2008
> java.io.IOException: Broken pipe
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> 	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> 
> 
> 	at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> 
> Any ideas what my problems could be?
> 
> TIA,
> 
> Andreas