You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Paul Hubenig <pa...@gmail.com> on 2013/01/08 03:38:27 UTC
streaming secondary sort not working?
hadoop jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.*.jar \
-input /export/home/phubenig/fileDataInput \
-output /export/home/phubenig/fileDataOutput \
-mapper /export/home/phubenig/fileDataJob/non_map.py \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
\
num.key.fields.for.partition=1 \
stream.num.map.output.key.fields=7 \
mapred.text.key.comparator.options="-k1,1 -k2,7" \
mapred.text.key.partitioner.options="-k1,1" \
-file /export/home/phubenig/fileDataInput/fileData.txt
~~~~~~~~~~~~
Input file (tab separated):
C k d m n h b
A w g i w t l
A w f y m y h
C u r d h c b
A y q w m g k
B w b s d q g
C q j j d f b
C l n x a g f
C o r m a g p
C v w l a t f
B c l f n t u
B x t o e x p
A q m r d q i
C e i o u g l
A x m w u o i
A j p m d k r
C s t m r m t
B s w l f k y
B a f r v f x
A s z d v s h
C o x j c w r
Sorts on first key (the capital letters) but does not perform the secondary
sort on the other fields. Does anyone see the problem? What am I
missing? Seems like it should work.
Thanks for your time.
Paul
non_map.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
stripped = line.rstrip()
print(stripped)