You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Dan Young <da...@gmail.com> on 2011/11/03 05:52:13 UTC

Streaming question.

I'm a total newbie @ Hadoop and and trying to follow an example (a Useful
Partitioner Class) on the Hadoop Streaming Wiki, but with my data. So I
have data like this:

520460379 1 14067 759015 1142 3 1 8.8
520460380 1 120543 2759354 1142 0 0 0
520460381 3 120543 2759352 1142 0 0 0
520460382 3 12660 679569 1142 0 0 0
520460383 1 120543 2759355 1142 0 0 0
520460384 3 120543 2759353 1142 0 0 0
520460385 1 120575 2759568 1142 0 0 0
520460386 3 120575 2759570 1142 0 0 0
520460387 1 120575 2759569 1142 0 0 0

and I'm trying to run a streaming job that partitions all the keys together
based on field 2 and field 3.  So for example 1 120543 2759354 and 1
120543 2759355 would
go to the same partitioner, and the output key(s) would be something
like 1.120543 .  I'm trying the following command but get an error:

$HADOOP_HOME/bin/hadoop  jar
$HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=2 \
-D mapreduce.map.output.key.field.separator=. \
-D mapreduce.partition.keypartitioner.options=-k1,2 \
-D mapreduce.job.reduces=1 \
-input $HOME/temp/foo \
-output dank_phase0 \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner


11/11/02 22:45:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/11/02 22:45:05 WARN mapred.JobClient: No job jar file set.  User classes
may not be found. See JobConf(Class) or JobConf#setJar(String).
11/11/02 22:45:05 INFO mapred.FileInputFormat: Total input paths to process
: 1
11/11/02 22:45:06 INFO streaming.StreamJob: getLocalDirs():
[/tmp/hadoop-dyoung/mapred/local]
11/11/02 22:45:06 INFO streaming.StreamJob: Running job: job_local_0001
11/11/02 22:45:06 INFO streaming.StreamJob: Job running in-process (local
Hadoop)
11/11/02 22:45:06 INFO mapred.FileInputFormat: Total input paths to process
: 1
11/11/02 22:45:07 INFO mapred.MapTask: numReduceTasks: 1
11/11/02 22:45:07 INFO mapred.MapTask: io.sort.mb = 200
11/11/02 22:45:07 INFO mapred.MapTask: data buffer = 159383552/199229440
11/11/02 22:45:07 INFO mapred.MapTask: record buffer = 524288/655360
11/11/02 22:45:07 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: Type mismatch in key from map: expected
org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
11/11/02 22:45:07 INFO streaming.StreamJob:  map 0%  reduce 0%
11/11/02 22:45:07 INFO streaming.StreamJob: Job running in-process (local
Hadoop)
11/11/02 22:45:07 ERROR streaming.StreamJob: Job not Successful!
11/11/02 22:45:07 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

I've tried a number of permutations of what's on the Hadoop Wiki, but I'm
still having the error. Does anyone have any insight into what I'm doing
wrong?

Regards,

Dan

Re: Streaming question.

Posted by Dan Young <da...@gmail.com>.
Praveen,

So is the KeyFieldBasedPartitioner broken in the current release (0.21 or
0.20.x)?  The bug link you reference refers to fix in 0.22.  Is there
anywhere I could download 0.22 to try this out?

What I really need to do, is to have all the keys for a given group,
written out to separate part-* files.  Is this the correct use of the
KeyFieldBasedPartitioner?
 and can this be done via Streaming only or do I need to write it in Java?

Regards,

Dan

On Wed, Nov 2, 2011 at 11:46 PM, Praveen Sripati
<pr...@gmail.com>wrote:

> Dan,
>
> It is a known bug (https://issues.apache.org/jira/browse/MAPREDUCE-1888)
> which has been identified in 0.21.0 release. Which Hadoop release are you
> using?
>
> Thanks,
> Praveen
>
> On Thu, Nov 3, 2011 at 10:22 AM, Dan Young <da...@gmail.com> wrote:
>
>> I'm a total newbie @ Hadoop and and trying to follow an example (a Useful
>> Partitioner Class) on the Hadoop Streaming Wiki, but with my data. So I
>> have data like this:
>>
>> 520460379 1 14067 759015 1142 3 1 8.8
>> 520460380 1 120543 2759354 1142 0 0 0
>> 520460381 3 120543 2759352 1142 0 0 0
>> 520460382 3 12660 679569 1142 0 0 0
>> 520460383 1 120543 2759355 1142 0 0 0
>> 520460384 3 120543 2759353 1142 0 0 0
>> 520460385 1 120575 2759568 1142 0 0 0
>> 520460386 3 120575 2759570 1142 0 0 0
>> 520460387 1 120575 2759569 1142 0 0 0
>>
>> and I'm trying to run a streaming job that partitions all the keys
>> together based on field 2 and field 3.  So for example 1 120543
>> 2759354 and 1 120543 2759355 would go to the same partitioner, and the
>> output key(s) would be something like 1.120543 .  I'm trying the following
>> command but get an error:
>>
>> $HADOOP_HOME/bin/hadoop  jar
>> $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \
>> -D stream.map.output.field.separator=. \
>> -D stream.num.map.output.key.fields=2 \
>> -D mapreduce.map.output.key.field.separator=. \
>> -D mapreduce.partition.keypartitioner.options=-k1,2 \
>> -D mapreduce.job.reduces=1 \
>> -input $HOME/temp/foo \
>> -output dank_phase0 \
>> -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
>> -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
>> -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
>>
>>
>> 11/11/02 22:45:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with
>> processName=JobTracker, sessionId=
>> 11/11/02 22:45:05 WARN mapred.JobClient: No job jar file set.  User
>> classes may not be found. See JobConf(Class) or JobConf#setJar(String).
>> 11/11/02 22:45:05 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 11/11/02 22:45:06 INFO streaming.StreamJob: getLocalDirs():
>> [/tmp/hadoop-dyoung/mapred/local]
>> 11/11/02 22:45:06 INFO streaming.StreamJob: Running job: job_local_0001
>> 11/11/02 22:45:06 INFO streaming.StreamJob: Job running in-process (local
>> Hadoop)
>> 11/11/02 22:45:06 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 11/11/02 22:45:07 INFO mapred.MapTask: numReduceTasks: 1
>> 11/11/02 22:45:07 INFO mapred.MapTask: io.sort.mb = 200
>> 11/11/02 22:45:07 INFO mapred.MapTask: data buffer = 159383552/199229440
>> 11/11/02 22:45:07 INFO mapred.MapTask: record buffer = 524288/655360
>> 11/11/02 22:45:07 WARN mapred.LocalJobRunner: job_local_0001
>> java.io.IOException: Type mismatch in key from map: expected
>> org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845)
>>  at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
>> at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40)
>>  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 11/11/02 22:45:07 INFO streaming.StreamJob:  map 0%  reduce 0%
>> 11/11/02 22:45:07 INFO streaming.StreamJob: Job running in-process (local
>> Hadoop)
>> 11/11/02 22:45:07 ERROR streaming.StreamJob: Job not Successful!
>> 11/11/02 22:45:07 INFO streaming.StreamJob: killJob...
>> Streaming Job Failed!
>>
>> I've tried a number of permutations of what's on the Hadoop Wiki, but I'm
>> still having the error. Does anyone have any insight into what I'm doing
>> wrong?
>>
>> Regards,
>>
>> Dan
>>
>>
>

Re: Streaming question.

Posted by Dan Young <da...@gmail.com>.
Hello Praveen,

I'm using 0.20.2. I can try it with 0.21 this morning when I get into the
office

Regards,

Dan
On Nov 2, 2011 11:47 PM, "Praveen Sripati" <pr...@gmail.com> wrote:

> Dan,
>
> It is a known bug (https://issues.apache.org/jira/browse/MAPREDUCE-1888)
> which has been identified in 0.21.0 release. Which Hadoop release are you
> using?
>
> Thanks,
> Praveen
>
> On Thu, Nov 3, 2011 at 10:22 AM, Dan Young <da...@gmail.com> wrote:
>
>> I'm a total newbie @ Hadoop and and trying to follow an example (a Useful
>> Partitioner Class) on the Hadoop Streaming Wiki, but with my data. So I
>> have data like this:
>>
>> 520460379 1 14067 759015 1142 3 1 8.8
>> 520460380 1 120543 2759354 1142 0 0 0
>> 520460381 3 120543 2759352 1142 0 0 0
>> 520460382 3 12660 679569 1142 0 0 0
>> 520460383 1 120543 2759355 1142 0 0 0
>> 520460384 3 120543 2759353 1142 0 0 0
>> 520460385 1 120575 2759568 1142 0 0 0
>> 520460386 3 120575 2759570 1142 0 0 0
>> 520460387 1 120575 2759569 1142 0 0 0
>>
>> and I'm trying to run a streaming job that partitions all the keys
>> together based on field 2 and field 3.  So for example 1 120543
>> 2759354 and 1 120543 2759355 would go to the same partitioner, and the
>> output key(s) would be something like 1.120543 .  I'm trying the following
>> command but get an error:
>>
>> $HADOOP_HOME/bin/hadoop  jar
>> $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \
>> -D stream.map.output.field.separator=. \
>> -D stream.num.map.output.key.fields=2 \
>> -D mapreduce.map.output.key.field.separator=. \
>> -D mapreduce.partition.keypartitioner.options=-k1,2 \
>> -D mapreduce.job.reduces=1 \
>> -input $HOME/temp/foo \
>> -output dank_phase0 \
>> -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
>> -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
>> -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
>>
>>
>> 11/11/02 22:45:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with
>> processName=JobTracker, sessionId=
>> 11/11/02 22:45:05 WARN mapred.JobClient: No job jar file set.  User
>> classes may not be found. See JobConf(Class) or JobConf#setJar(String).
>> 11/11/02 22:45:05 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 11/11/02 22:45:06 INFO streaming.StreamJob: getLocalDirs():
>> [/tmp/hadoop-dyoung/mapred/local]
>> 11/11/02 22:45:06 INFO streaming.StreamJob: Running job: job_local_0001
>> 11/11/02 22:45:06 INFO streaming.StreamJob: Job running in-process (local
>> Hadoop)
>> 11/11/02 22:45:06 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 11/11/02 22:45:07 INFO mapred.MapTask: numReduceTasks: 1
>> 11/11/02 22:45:07 INFO mapred.MapTask: io.sort.mb = 200
>> 11/11/02 22:45:07 INFO mapred.MapTask: data buffer = 159383552/199229440
>> 11/11/02 22:45:07 INFO mapred.MapTask: record buffer = 524288/655360
>> 11/11/02 22:45:07 WARN mapred.LocalJobRunner: job_local_0001
>> java.io.IOException: Type mismatch in key from map: expected
>> org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845)
>>  at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
>> at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40)
>>  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 11/11/02 22:45:07 INFO streaming.StreamJob:  map 0%  reduce 0%
>> 11/11/02 22:45:07 INFO streaming.StreamJob: Job running in-process (local
>> Hadoop)
>> 11/11/02 22:45:07 ERROR streaming.StreamJob: Job not Successful!
>> 11/11/02 22:45:07 INFO streaming.StreamJob: killJob...
>> Streaming Job Failed!
>>
>> I've tried a number of permutations of what's on the Hadoop Wiki, but I'm
>> still having the error. Does anyone have any insight into what I'm doing
>> wrong?
>>
>> Regards,
>>
>> Dan
>>
>>
>

Re: Streaming question.

Posted by Praveen Sripati <pr...@gmail.com>.
Dan,

It is a known bug (https://issues.apache.org/jira/browse/MAPREDUCE-1888)
which has been identified in 0.21.0 release. Which Hadoop release are you
using?

Thanks,
Praveen

On Thu, Nov 3, 2011 at 10:22 AM, Dan Young <da...@gmail.com> wrote:

> I'm a total newbie @ Hadoop and and trying to follow an example (a Useful
> Partitioner Class) on the Hadoop Streaming Wiki, but with my data. So I
> have data like this:
>
> 520460379 1 14067 759015 1142 3 1 8.8
> 520460380 1 120543 2759354 1142 0 0 0
> 520460381 3 120543 2759352 1142 0 0 0
> 520460382 3 12660 679569 1142 0 0 0
> 520460383 1 120543 2759355 1142 0 0 0
> 520460384 3 120543 2759353 1142 0 0 0
> 520460385 1 120575 2759568 1142 0 0 0
> 520460386 3 120575 2759570 1142 0 0 0
> 520460387 1 120575 2759569 1142 0 0 0
>
> and I'm trying to run a streaming job that partitions all the keys
> together based on field 2 and field 3.  So for example 1 120543
> 2759354 and 1 120543 2759355 would go to the same partitioner, and the
> output key(s) would be something like 1.120543 .  I'm trying the following
> command but get an error:
>
> $HADOOP_HOME/bin/hadoop  jar
> $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \
> -D stream.map.output.field.separator=. \
> -D stream.num.map.output.key.fields=2 \
> -D mapreduce.map.output.key.field.separator=. \
> -D mapreduce.partition.keypartitioner.options=-k1,2 \
> -D mapreduce.job.reduces=1 \
> -input $HOME/temp/foo \
> -output dank_phase0 \
> -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
> -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
> -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
>
>
> 11/11/02 22:45:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 11/11/02 22:45:05 WARN mapred.JobClient: No job jar file set.  User
> classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 11/11/02 22:45:05 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 11/11/02 22:45:06 INFO streaming.StreamJob: getLocalDirs():
> [/tmp/hadoop-dyoung/mapred/local]
> 11/11/02 22:45:06 INFO streaming.StreamJob: Running job: job_local_0001
> 11/11/02 22:45:06 INFO streaming.StreamJob: Job running in-process (local
> Hadoop)
> 11/11/02 22:45:06 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 11/11/02 22:45:07 INFO mapred.MapTask: numReduceTasks: 1
> 11/11/02 22:45:07 INFO mapred.MapTask: io.sort.mb = 200
> 11/11/02 22:45:07 INFO mapred.MapTask: data buffer = 159383552/199229440
> 11/11/02 22:45:07 INFO mapred.MapTask: record buffer = 524288/655360
> 11/11/02 22:45:07 WARN mapred.LocalJobRunner: job_local_0001
> java.io.IOException: Type mismatch in key from map: expected
> org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845)
>  at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
> at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40)
>  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 11/11/02 22:45:07 INFO streaming.StreamJob:  map 0%  reduce 0%
> 11/11/02 22:45:07 INFO streaming.StreamJob: Job running in-process (local
> Hadoop)
> 11/11/02 22:45:07 ERROR streaming.StreamJob: Job not Successful!
> 11/11/02 22:45:07 INFO streaming.StreamJob: killJob...
> Streaming Job Failed!
>
> I've tried a number of permutations of what's on the Hadoop Wiki, but I'm
> still having the error. Does anyone have any insight into what I'm doing
> wrong?
>
> Regards,
>
> Dan
>
>