You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Premal <pr...@gmail.com> on 2011/08/06 00:34:35 UTC
Hadoop order of operations
According to the attached image found on yahoo's hadoop tutorial, the order
of operations is map > combine > partition which should be followed by
reduce
Here is my an example key emmited by the map operation
LongValueSum:geo_US|1311722400|E 1
Assuming there are 100 keys of the same type, this should get combined as
geo_US|1311722400|E 100
Then i'd like to partition the keys by the value before the first pipe(|)
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29
geo_US
Here's the streaming command
hadoop jar
/usr/local/hadoop/contrib/streaming/hadoop-streaming-0.20.203.0.jar \
-D mapred.reduce.tasks=8 \
-D stream.num.map.output.key.fields=1 \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.map.output.field.separator=\| \
-file mapper.py \
-mapper mapper.py \
-file reducer.py \
-reducer reducer.py \
-combiner org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer
\
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-input input_file \
-output output_path
This is the error I get
java.lang.NumberFormatException: For input string: "1311722400|E 1"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:419)
at java.lang.Long.parseLong(Long.java:468)
at
org.apache.hadoop.mapred.lib.aggregate.LongValueSum.addNextValue(LongValueSum.java:48)
at
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:59)
at
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:35)
at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1349)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1435)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1297)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
I looks like the partitioner is running before the combiner. Any thoughts?
--
View this message in context: http://old.nabble.com/Hadoop-order-of-operations-tp32205781p32205781.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop order of operations
Posted by Harsh J <ha...@cloudera.com>.
Premal,
Didn't go through your entire thread, but the right order is: "map"
(N) -> "partition" (N) -> "combine" (0…N).
On Sat, Aug 6, 2011 at 4:04 AM, Premal <pr...@gmail.com> wrote:
>
> According to the attached image found on yahoo's hadoop tutorial, the order
> of operations is map > combine > partition which should be followed by
> reduce
>
> Here is my an example key emmited by the map operation
>
> LongValueSum:geo_US|1311722400|E 1
>
> Assuming there are 100 keys of the same type, this should get combined as
>
> geo_US|1311722400|E 100
>
> Then i'd like to partition the keys by the value before the first pipe(|)
> http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29
>
> geo_US
>
> Here's the streaming command
>
> hadoop jar
> /usr/local/hadoop/contrib/streaming/hadoop-streaming-0.20.203.0.jar \
> -D mapred.reduce.tasks=8 \
> -D stream.num.map.output.key.fields=1 \
> -D mapred.text.key.partitioner.options=-k1,1 \
> -D stream.map.output.field.separator=\| \
> -file mapper.py \
> -mapper mapper.py \
> -file reducer.py \
> -reducer reducer.py \
> -combiner org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer
> \
> -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
> -input input_file \
> -output output_path
>
>
> This is the error I get
> java.lang.NumberFormatException: For input string: "1311722400|E 1"
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Long.parseLong(Long.java:419)
> at java.lang.Long.parseLong(Long.java:468)
> at
> org.apache.hadoop.mapred.lib.aggregate.LongValueSum.addNextValue(LongValueSum.java:48)
> at
> org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:59)
> at
> org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:35)
> at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1349)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1435)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1297)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> at org.apache.hadoop.mapred.Child.main(Child.java:253)
>
> I looks like the partitioner is running before the combiner. Any thoughts?
> --
> View this message in context: http://old.nabble.com/Hadoop-order-of-operations-tp32205781p32205781.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
--
Harsh J