You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Paul Hubenig <pa...@gmail.com> on 2013/01/08 03:38:27 UTC

streaming secondary sort not working?

hadoop jar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.*.jar \

-input /export/home/phubenig/fileDataInput \

-output /export/home/phubenig/fileDataOutput \

-mapper /export/home/phubenig/fileDataJob/non_map.py \

-reducer org.apache.hadoop.mapred.lib.IdentityReducer \

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
\

num.key.fields.for.partition=1 \

stream.num.map.output.key.fields=7 \

mapred.text.key.comparator.options="-k1,1 -k2,7" \

mapred.text.key.partitioner.options="-k1,1" \
 -file /export/home/phubenig/fileDataInput/fileData.txt

~~~~~~~~~~~~

Input file (tab separated):

C k d m n h b

A w g i w t l

A w f y m y h

C u r d h c b

A y q w m g k

B w b s d q g

C q j j d f b

C l n x a g f

C o r m a g p

C v w l a t f

B c l f n t u

B x t o e x p

A q m r d q i

C e i o u g l

A x m w u o i

A j p m d k r

C s t m r m t

B s w l f k y

B a f r v f x

A s z d v s h

C o x j c w r



Sorts on first key (the capital letters) but does not perform the secondary
sort on the other fields.  Does anyone see the problem?  What am I
missing?  Seems like it should work.



Thanks for your time.



Paul



non_map.py:

#!/usr/bin/env python

import sys

for line in sys.stdin:

stripped = line.rstrip()

print(stripped)