You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jack Stahl <ja...@yelp.com> on 2009/02/04 04:49:38 UTC

Value-Only Reduce Output

Hello,

I'm interested in a map-reduce flow where I output only values (no keys) in
my reduce step.  For example, imagine the canonical word-counting program
where I'd like my output to be an unlabeled histogram of counts instead of
(word, count) pairs.

I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
my python scripts).  When I simulate the map reduce using pipes and sort in
bash, it works fine.   However, in Hadoop, if I output a value with no tabs,
Hadoop appends a trailing "\t", apparently interpreting my output as a
(value, "") KV pair.  I'd like to avoid outputing this trailing tab if
possible.

Is there a command line option that could be use to effect this?  More
generally, is there something wrong with outputing arbitrary strings,
instead of key-value pairs, in your reduce step?

Re: Value-Only Reduce Output

Posted by Jack Stahl <ja...@yelp.com>.
My (0.18.2) reduce src looks like this:

          write(key);
          clientOut_.write('\t');
          write(val);
          clientOut_.write('\n');

which explains why avoiding the trailing tab is unavoidable.

Thanks for your help, though, Jason!

2009/2/4 jason hadoop <ja...@gmail.com>

> For your reduce, the parameter is stream.reduce.input.field.separator, if
> you are supplying a reduce class and I believe the output format is
> TextOutputFormat...
>
> It looks like you have tried the map parameter for the separator, not the
> reduce parameter.
>
> From 0.19.0 PipeReducer:
> configure:
>      reduceOutFieldSeparator =
> job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");
>      reduceInputFieldSeparator =
> job_.get("stream.reduce.input.field.separator", "\t").getBytes("UTF-8");
>      this.numOfReduceOutputKeyFields =
> job_.getInt("stream.num.reduce.output.key.fields", 1);
>
> getInputSeparator:
>  byte[] getInputSeparator() {
>    return reduceInputFieldSeparator;
>  }
>
> reduce:
>          write(key);
> *          clientOut_.write(getInputSeparator());*
>          write(val);
>          clientOut_.write('\n');
>        } else {
>          // "identity reduce"
> *          output.collect(key, val);*
>         }
>
>
> On Wed, Feb 4, 2009 at 6:15 AM, Rasit OZDAS <ra...@gmail.com> wrote:
>
> > I tried it myself, it doesn't work.
> > I've also tried   stream.map.output.field.separator   and
> > map.output.key.field.separator  parameters for this purpose, they
> > don't work either. When hadoop sees empty string, it takes default tab
> > character instead.
> >
> > Rasit
> >
> > 2009/2/4 jason hadoop <ja...@gmail.com>
> > >
> > > Ooops, you are using streaming., and I am not familar.
> > > As a terrible hack, you could set mapred.textoutputformat.separator to
> > the
> > > empty string, in your configuration.
> > >
> > > On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop <ja...@gmail.com>
> > wrote:
> > >
> > > > If you are using the standard TextOutputFormat, and the output
> > collector is
> > > > passed a null for the value, there will not be a trailing tab
> character
> > > > added to the output line.
> > > >
> > > > output.collect( key, null );
> > > > Will give you the behavior you are looking for if your configuration
> is
> > as
> > > > I expect.
> > > >
> > > >
> > > > On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl <ja...@yelp.com> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> I'm interested in a map-reduce flow where I output only values (no
> > keys)
> > > >> in
> > > >> my reduce step.  For example, imagine the canonical word-counting
> > program
> > > >> where I'd like my output to be an unlabeled histogram of counts
> > instead of
> > > >> (word, count) pairs.
> > > >>
> > > >> I'm using HadoopStreaming (specifically, I'm using the dumbo module
> to
> > run
> > > >> my python scripts).  When I simulate the map reduce using pipes and
> > sort
> > > >> in
> > > >> bash, it works fine.   However, in Hadoop, if I output a value with
> no
> > > >> tabs,
> > > >> Hadoop appends a trailing "\t", apparently interpreting my output as
> a
> > > >> (value, "") KV pair.  I'd like to avoid outputing this trailing tab
> if
> > > >> possible.
> > > >>
> > > >> Is there a command line option that could be use to effect this?
>  More
> > > >> generally, is there something wrong with outputing arbitrary
> strings,
> > > >> instead of key-value pairs, in your reduce step?
> > > >>
> > > >
> > > >
> >
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>

Re: Value-Only Reduce Output

Posted by jason hadoop <ja...@gmail.com>.
For your reduce, the parameter is stream.reduce.input.field.separator, if
you are supplying a reduce class and I believe the output format is
TextOutputFormat...

It looks like you have tried the map parameter for the separator, not the
reduce parameter.

>From 0.19.0 PipeReducer:
configure:
      reduceOutFieldSeparator =
job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");
      reduceInputFieldSeparator =
job_.get("stream.reduce.input.field.separator", "\t").getBytes("UTF-8");
      this.numOfReduceOutputKeyFields =
job_.getInt("stream.num.reduce.output.key.fields", 1);

getInputSeparator:
  byte[] getInputSeparator() {
    return reduceInputFieldSeparator;
  }

reduce:
          write(key);
*          clientOut_.write(getInputSeparator());*
          write(val);
          clientOut_.write('\n');
        } else {
          // "identity reduce"
*          output.collect(key, val);*
        }


On Wed, Feb 4, 2009 at 6:15 AM, Rasit OZDAS <ra...@gmail.com> wrote:

> I tried it myself, it doesn't work.
> I've also tried   stream.map.output.field.separator   and
> map.output.key.field.separator  parameters for this purpose, they
> don't work either. When hadoop sees empty string, it takes default tab
> character instead.
>
> Rasit
>
> 2009/2/4 jason hadoop <ja...@gmail.com>
> >
> > Ooops, you are using streaming., and I am not familar.
> > As a terrible hack, you could set mapred.textoutputformat.separator to
> the
> > empty string, in your configuration.
> >
> > On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop <ja...@gmail.com>
> wrote:
> >
> > > If you are using the standard TextOutputFormat, and the output
> collector is
> > > passed a null for the value, there will not be a trailing tab character
> > > added to the output line.
> > >
> > > output.collect( key, null );
> > > Will give you the behavior you are looking for if your configuration is
> as
> > > I expect.
> > >
> > >
> > > On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl <ja...@yelp.com> wrote:
> > >
> > >> Hello,
> > >>
> > >> I'm interested in a map-reduce flow where I output only values (no
> keys)
> > >> in
> > >> my reduce step.  For example, imagine the canonical word-counting
> program
> > >> where I'd like my output to be an unlabeled histogram of counts
> instead of
> > >> (word, count) pairs.
> > >>
> > >> I'm using HadoopStreaming (specifically, I'm using the dumbo module to
> run
> > >> my python scripts).  When I simulate the map reduce using pipes and
> sort
> > >> in
> > >> bash, it works fine.   However, in Hadoop, if I output a value with no
> > >> tabs,
> > >> Hadoop appends a trailing "\t", apparently interpreting my output as a
> > >> (value, "") KV pair.  I'd like to avoid outputing this trailing tab if
> > >> possible.
> > >>
> > >> Is there a command line option that could be use to effect this?  More
> > >> generally, is there something wrong with outputing arbitrary strings,
> > >> instead of key-value pairs, in your reduce step?
> > >>
> > >
> > >
>
>
>
> --
> M. Raşit ÖZDAŞ
>

Re: Value-Only Reduce Output

Posted by Rasit OZDAS <ra...@gmail.com>.
I tried it myself, it doesn't work.
I've also tried   stream.map.output.field.separator   and
map.output.key.field.separator  parameters for this purpose, they
don't work either. When hadoop sees empty string, it takes default tab
character instead.

Rasit

2009/2/4 jason hadoop <ja...@gmail.com>
>
> Ooops, you are using streaming., and I am not familar.
> As a terrible hack, you could set mapred.textoutputformat.separator to the
> empty string, in your configuration.
>
> On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop <ja...@gmail.com> wrote:
>
> > If you are using the standard TextOutputFormat, and the output collector is
> > passed a null for the value, there will not be a trailing tab character
> > added to the output line.
> >
> > output.collect( key, null );
> > Will give you the behavior you are looking for if your configuration is as
> > I expect.
> >
> >
> > On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl <ja...@yelp.com> wrote:
> >
> >> Hello,
> >>
> >> I'm interested in a map-reduce flow where I output only values (no keys)
> >> in
> >> my reduce step.  For example, imagine the canonical word-counting program
> >> where I'd like my output to be an unlabeled histogram of counts instead of
> >> (word, count) pairs.
> >>
> >> I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
> >> my python scripts).  When I simulate the map reduce using pipes and sort
> >> in
> >> bash, it works fine.   However, in Hadoop, if I output a value with no
> >> tabs,
> >> Hadoop appends a trailing "\t", apparently interpreting my output as a
> >> (value, "") KV pair.  I'd like to avoid outputing this trailing tab if
> >> possible.
> >>
> >> Is there a command line option that could be use to effect this?  More
> >> generally, is there something wrong with outputing arbitrary strings,
> >> instead of key-value pairs, in your reduce step?
> >>
> >
> >



--
M. Raşit ÖZDAŞ

Re: Value-Only Reduce Output

Posted by jason hadoop <ja...@gmail.com>.
Ooops, you are using streaming., and I am not familar.
As a terrible hack, you could set mapred.textoutputformat.separator to the
empty string, in your configuration.

On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop <ja...@gmail.com> wrote:

> If you are using the standard TextOutputFormat, and the output collector is
> passed a null for the value, there will not be a trailing tab character
> added to the output line.
>
> output.collect( key, null );
> Will give you the behavior you are looking for if your configuration is as
> I expect.
>
>
> On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl <ja...@yelp.com> wrote:
>
>> Hello,
>>
>> I'm interested in a map-reduce flow where I output only values (no keys)
>> in
>> my reduce step.  For example, imagine the canonical word-counting program
>> where I'd like my output to be an unlabeled histogram of counts instead of
>> (word, count) pairs.
>>
>> I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
>> my python scripts).  When I simulate the map reduce using pipes and sort
>> in
>> bash, it works fine.   However, in Hadoop, if I output a value with no
>> tabs,
>> Hadoop appends a trailing "\t", apparently interpreting my output as a
>> (value, "") KV pair.  I'd like to avoid outputing this trailing tab if
>> possible.
>>
>> Is there a command line option that could be use to effect this?  More
>> generally, is there something wrong with outputing arbitrary strings,
>> instead of key-value pairs, in your reduce step?
>>
>
>

Re: Value-Only Reduce Output

Posted by jason hadoop <ja...@gmail.com>.
If you are using the standard TextOutputFormat, and the output collector is
passed a null for the value, there will not be a trailing tab character
added to the output line.

output.collect( key, null );
Will give you the behavior you are looking for if your configuration is as I
expect.

On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl <ja...@yelp.com> wrote:

> Hello,
>
> I'm interested in a map-reduce flow where I output only values (no keys) in
> my reduce step.  For example, imagine the canonical word-counting program
> where I'd like my output to be an unlabeled histogram of counts instead of
> (word, count) pairs.
>
> I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
> my python scripts).  When I simulate the map reduce using pipes and sort in
> bash, it works fine.   However, in Hadoop, if I output a value with no
> tabs,
> Hadoop appends a trailing "\t", apparently interpreting my output as a
> (value, "") KV pair.  I'd like to avoid outputing this trailing tab if
> possible.
>
> Is there a command line option that could be use to effect this?  More
> generally, is there something wrong with outputing arbitrary strings,
> instead of key-value pairs, in your reduce step?
>