You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Alan Drew <dr...@yahoo.com> on 2009/05/13 04:55:13 UTC

hadoop streaming reducer values

Hi,

I have a question about the <key, values> that the reducer gets in Hadoop
Streaming.

I wrote a simple mapper.sh, reducer.sh script files:

mapper.sh : 

#!/bin/bash

while read data
do
  #tokenize the data and output the values <word, 1>
  echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}'
done

reducer.sh :

#!/bin/bash

while read data
do
  echo -e $data
done

The mapper tokenizes a line of input and outputs <word, 1> pairs to standard
output.  The reducer just outputs what it gets from standard input.

I have a simple input file:

cat in the hat
ate my mat the

I was expecting the final output to be something like:

the 1 1 1 
cat 1

etc.

but instead each word has its own line, which makes me think that
<key,value> is being given to the reducer and not <key, values> which is
default for normal Hadoop (in Java) right?

the 1
the 1
the 1
cat 1

Is there any way to get <key, values> for the reducer and not a bunch of
<key, value> pairs?  I looked into the -reducer aggregate option, but there
doesn't seem to be a way to customize what the reducer does with the <key,
values> other than max,min functions.

Thanks.
-- 
View this message in context: http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: hadoop streaming reducer values

Posted by jason hadoop <ja...@gmail.com>.
You may wish to set the separator to the string comma space ', ' for your
example.
chapter 7 of my book goes into this in some detail, and I posted a graphic
that visually depicts the process and the values
about a month ago.
The original post was titled 'Changing key/value separator in hadoop
streaming'
and I have attached the graphic.


On Tue, May 12, 2009 at 7:55 PM, Alan Drew <dr...@yahoo.com> wrote:

>
> Hi,
>
> I have a question about the <key, values> that the reducer gets in Hadoop
> Streaming.
>
> I wrote a simple mapper.sh, reducer.sh script files:
>
> mapper.sh :
>
> #!/bin/bash
>
> while read data
> do
>  #tokenize the data and output the values <word, 1>
>  echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}'
> done
>
> reducer.sh :
>
> #!/bin/bash
>
> while read data
> do
>  echo -e $data
> done
>
> The mapper tokenizes a line of input and outputs <word, 1> pairs to
> standard
> output.  The reducer just outputs what it gets from standard input.
>
> I have a simple input file:
>
> cat in the hat
> ate my mat the
>
> I was expecting the final output to be something like:
>
> the 1 1 1
> cat 1
>
> etc.
>
> but instead each word has its own line, which makes me think that
> <key,value> is being given to the reducer and not <key, values> which is
> default for normal Hadoop (in Java) right?
>
> the 1
> the 1
> the 1
> cat 1
>
> Is there any way to get <key, values> for the reducer and not a bunch of
> <key, value> pairs?  I looked into the -reducer aggregate option, but there
> doesn't seem to be a way to customize what the reducer does with the <key,
> values> other than max,min functions.
>
> Thanks.
> --
> View this message in context:
> http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: hadoop streaming reducer values

Posted by jason hadoop <ja...@gmail.com>.
Thanks chuck, I didn't read the post and focused on the commas


On Wed, May 13, 2009 at 2:38 PM, Chuck Lam <ch...@gmail.com> wrote:

> The behavior you saw in Streaming (list of <key,value> instead of <key,
> list
> of values>) is indeed intentional, and it's part of the design differences
> between Streaming and Hadoop Java. That is, in Streaming your reducer is
> responsible for "grouping" values of the same key, whereas in Java the
> grouping is done for you.
>
> However, the input to your reducer is still sorted (and partitioned) on the
> key, so all key/value pairs of the same key will arrive at your reducer in
> one contiguous chunk. Your reducer can keep a last_key variable to track
> whether all records of the same key have been read in. In Python a reducer
> that sums up all values of a key is like this:
>
> #!/usr/bin/env python
>
> import sys
>
> (last_key, sum) = (None, 0.0)
>
> for line in sys.stdin:
>    (key, val) = line.split("\t")
>
>    if last_key and last_key != key:
>        print last_key + "\t" + str(sum)
>        sum = 0.0
>
>    last_key = key
>    sum   += float(val)
>
> print last_key + "\t" + str(sum)
>
>
> Streaming is covered in all 3 upcoming Hadoop books. The above is an
> example
> from mine ;)  http://www.manning.com/lam/ . Tom White has the definite
> guide
> from O'Reilly - http://www.hadoopbook.com/ . Jason has
> http://www.apress.com/book/view/9781430219422
>
>
>
>
>
> On Tue, May 12, 2009 at 7:55 PM, Alan Drew <dr...@yahoo.com> wrote:
>
> >
> > Hi,
> >
> > I have a question about the <key, values> that the reducer gets in Hadoop
> > Streaming.
> >
> > I wrote a simple mapper.sh, reducer.sh script files:
> >
> > mapper.sh :
> >
> > #!/bin/bash
> >
> > while read data
> > do
> >  #tokenize the data and output the values <word, 1>
> >  echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}'
> > done
> >
> > reducer.sh :
> >
> > #!/bin/bash
> >
> > while read data
> > do
> >  echo -e $data
> > done
> >
> > The mapper tokenizes a line of input and outputs <word, 1> pairs to
> > standard
> > output.  The reducer just outputs what it gets from standard input.
> >
> > I have a simple input file:
> >
> > cat in the hat
> > ate my mat the
> >
> > I was expecting the final output to be something like:
> >
> > the 1 1 1
> > cat 1
> >
> > etc.
> >
> > but instead each word has its own line, which makes me think that
> > <key,value> is being given to the reducer and not <key, values> which is
> > default for normal Hadoop (in Java) right?
> >
> > the 1
> > the 1
> > the 1
> > cat 1
> >
> > Is there any way to get <key, values> for the reducer and not a bunch of
> > <key, value> pairs?  I looked into the -reducer aggregate option, but
> there
> > doesn't seem to be a way to customize what the reducer does with the
> <key,
> > values> other than max,min functions.
> >
> > Thanks.
> > --
> > View this message in context:
> >
> http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: hadoop streaming reducer values

Posted by Chuck Lam <ch...@gmail.com>.
The behavior you saw in Streaming (list of <key,value> instead of <key, list
of values>) is indeed intentional, and it's part of the design differences
between Streaming and Hadoop Java. That is, in Streaming your reducer is
responsible for "grouping" values of the same key, whereas in Java the
grouping is done for you.

However, the input to your reducer is still sorted (and partitioned) on the
key, so all key/value pairs of the same key will arrive at your reducer in
one contiguous chunk. Your reducer can keep a last_key variable to track
whether all records of the same key have been read in. In Python a reducer
that sums up all values of a key is like this:

#!/usr/bin/env python

import sys

(last_key, sum) = (None, 0.0)

for line in sys.stdin:
    (key, val) = line.split("\t")

    if last_key and last_key != key:
        print last_key + "\t" + str(sum)
        sum = 0.0

    last_key = key
    sum   += float(val)

print last_key + "\t" + str(sum)


Streaming is covered in all 3 upcoming Hadoop books. The above is an example
from mine ;)  http://www.manning.com/lam/ . Tom White has the definite guide
from O'Reilly - http://www.hadoopbook.com/ . Jason has
http://www.apress.com/book/view/9781430219422





On Tue, May 12, 2009 at 7:55 PM, Alan Drew <dr...@yahoo.com> wrote:

>
> Hi,
>
> I have a question about the <key, values> that the reducer gets in Hadoop
> Streaming.
>
> I wrote a simple mapper.sh, reducer.sh script files:
>
> mapper.sh :
>
> #!/bin/bash
>
> while read data
> do
>  #tokenize the data and output the values <word, 1>
>  echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}'
> done
>
> reducer.sh :
>
> #!/bin/bash
>
> while read data
> do
>  echo -e $data
> done
>
> The mapper tokenizes a line of input and outputs <word, 1> pairs to
> standard
> output.  The reducer just outputs what it gets from standard input.
>
> I have a simple input file:
>
> cat in the hat
> ate my mat the
>
> I was expecting the final output to be something like:
>
> the 1 1 1
> cat 1
>
> etc.
>
> but instead each word has its own line, which makes me think that
> <key,value> is being given to the reducer and not <key, values> which is
> default for normal Hadoop (in Java) right?
>
> the 1
> the 1
> the 1
> cat 1
>
> Is there any way to get <key, values> for the reducer and not a bunch of
> <key, value> pairs?  I looked into the -reducer aggregate option, but there
> doesn't seem to be a way to customize what the reducer does with the <key,
> values> other than max,min functions.
>
> Thanks.
> --
> View this message in context:
> http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>