You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Aleksandr Elbakyan <ra...@yahoo.com> on 2013/01/15 01:23:57 UTC

Issue with partitioning using streaming

Hello All,

I am trying to partition data and sort it in hadoop streaming. 


Most of the time the data is sorted and partitioned correctly but if I run multiple times sometimes data goes to other partition 




The data looks like

asdas 0 ada
asdas 1 asd
12123 1 ccc
12123 0 xxx



  hadoop  jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming.jar \
        -D mapred.task.timeout=3600000 \
        -D mapred.map.tasks=${GD_NUM_MAP_TASKS}  \
        -D mapred.reduce.tasks=${GD_NUM_REDUCE_TASKS} \
        -D stream.non.zero.exit.is.failure=true \
        -D stream.num.map.output.key.fields=2 \
        -D mapred.text.key.partitioner.options="-k1,1" \
        -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
        -D mapred.text.key.comparator.options=-k1,2n \
        -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
        -input input \
        -output output \
        -mapper  "  cat" \
        -reducer " cat" \
        -verbose


in reducer code I have some logic which depend on correct partitioning and sorting.


Regards.