You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kelly Burkhart <ke...@gmail.com> on 2011/02/10 18:45:37 UTC

Map reduce streaming unable to partition

Hi,

I'm trying to get partitioning working from a streaming map/reduce
job.  I'm using hadoop r0.20.2.

Consider the following files, both in the same hdfs directory:

f1:
01:01:01<TAB>a,a,a,a,a,1
01:01:02<TAB>a,a,a,a,a,2
01:02:01<TAB>a,a,a,a,a,3
01:02:02<TAB>a,a,a,a,a,4
02:01:01<TAB>a,a,a,a,a,5
02:01:02<TAB>a,a,a,a,a,6
02:02:01<TAB>a,a,a,a,a,7
02:02:02<TAB>a,a,a,a,a,8

f2:
01:01:01<TAB>b,b,b,b,b,1
01:01:02<TAB>b,b,b,b,b,2
01:02:01<TAB>b,b,b,b,b,3
01:02:02<TAB>b,b,b,b,b,4
02:01:01<TAB>b,b,b,b,b,5
02:01:02<TAB>b,b,b,b,b,6
02:02:01<TAB>b,b,b,b,b,7
02:02:02<TAB>b,b,b,b,b,8

I execute the following command:

hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
  -D stream.map.output.field.separator=: \
  -D stream.num.map.output.key.fields=3 \
  -D map.output.key.field.separator=: \
  -D mapred.text.key.partitioner.options=-k1,1 \
  -input /tmp/krb/part \
  -output /tmp/krb/mp \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

(actually I've executed about a zillion permutations of various -D arguments...)

I end up with a single file sorted by the entire key, exactly what I
expect if no partitioning at all is going on.  What I'm hoping to end
up with is two output files, each file has the first component of the
key in common:

01:01:01<TAB>a,a,a,a,a,1
01:01:01<TAB>b,b,b,b,b,1
01:01:02<TAB>a,a,a,a,a,2
01:01:02<TAB>b,b,b,b,b,2
01:02:01<TAB>a,a,a,a,a,3
01:02:01<TAB>b,b,b,b,b,3
01:02:02<TAB>a,a,a,a,a,4
01:02:02<TAB>b,b,b,b,b,4

Can anyone suggest a command that may partition files as I describe?

Also, it seems that the API has changed considerably from my version
0.20.x to the latest version r0.21.  Is 0.20 expected to work?  Or are
there some fatal issues that forced major work resulting in release
0.21.

Thanks,

-Kelly

Re: Map reduce streaming unable to partition

Posted by Kelly Burkhart <ke...@gmail.com>.
OK, I think I sumbled upon the correct incantation:

time hadoop jar
/opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
  -D map.output.key.field.separator=: \
  -D mapred.text.key.partitioner.options=-k1,1 \
  -D mapred.reduce.tasks=16 \
  -input /tmp/krb/part \
  -output /tmp/krb/mp \
  -mapper /bin/cat \
  -reducer /bin/cat \
  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

This will partition and sort the files as I expect, leaving me with 16
output files, 14 of which are empty and 2 non-empty.  If I increase
the number of partitions in the data so they exceed the number of
reduce tasks, multiple partitions will be written to some or all of
the output files.  I believe I can deal with that now that I
understand it, but it would be nice if the number of output files was
equal to the number of partitions in the data.

-K

On Thu, Feb 10, 2011 at 11:45 AM, Kelly Burkhart
<ke...@gmail.com> wrote:
> Hi,
>
> I'm trying to get partitioning working from a streaming map/reduce
> job.  I'm using hadoop r0.20.2.
>
> Consider the following files, both in the same hdfs directory:
>
> f1:
> 01:01:01<TAB>a,a,a,a,a,1
> 01:01:02<TAB>a,a,a,a,a,2
> 01:02:01<TAB>a,a,a,a,a,3
> 01:02:02<TAB>a,a,a,a,a,4
> 02:01:01<TAB>a,a,a,a,a,5
> 02:01:02<TAB>a,a,a,a,a,6
> 02:02:01<TAB>a,a,a,a,a,7
> 02:02:02<TAB>a,a,a,a,a,8
>
> f2:
> 01:01:01<TAB>b,b,b,b,b,1
> 01:01:02<TAB>b,b,b,b,b,2
> 01:02:01<TAB>b,b,b,b,b,3
> 01:02:02<TAB>b,b,b,b,b,4
> 02:01:01<TAB>b,b,b,b,b,5
> 02:01:02<TAB>b,b,b,b,b,6
> 02:02:01<TAB>b,b,b,b,b,7
> 02:02:02<TAB>b,b,b,b,b,8
>
> I execute the following command:
>
> hadoop jar /opt/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar \
>  -D stream.map.output.field.separator=: \
>  -D stream.num.map.output.key.fields=3 \
>  -D map.output.key.field.separator=: \
>  -D mapred.text.key.partitioner.options=-k1,1 \
>  -input /tmp/krb/part \
>  -output /tmp/krb/mp \
>  -mapper /bin/cat \
>  -reducer /bin/cat \
>  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
>
> (actually I've executed about a zillion permutations of various -D arguments...)
>
> I end up with a single file sorted by the entire key, exactly what I
> expect if no partitioning at all is going on.  What I'm hoping to end
> up with is two output files, each file has the first component of the
> key in common:
>
> 01:01:01<TAB>a,a,a,a,a,1
> 01:01:01<TAB>b,b,b,b,b,1
> 01:01:02<TAB>a,a,a,a,a,2
> 01:01:02<TAB>b,b,b,b,b,2
> 01:02:01<TAB>a,a,a,a,a,3
> 01:02:01<TAB>b,b,b,b,b,3
> 01:02:02<TAB>a,a,a,a,a,4
> 01:02:02<TAB>b,b,b,b,b,4
>
> Can anyone suggest a command that may partition files as I describe?
>
> Also, it seems that the API has changed considerably from my version
> 0.20.x to the latest version r0.21.  Is 0.20 expected to work?  Or are
> there some fatal issues that forced major work resulting in release
> 0.21.
>
> Thanks,
>
> -Kelly
>