You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Yuri Pradkin <yu...@isi.edu> on 2008/02/12 21:21:26 UTC
key/value after reduce
Hi,
I'm relatively new to Hadoop and I have what I hope is a simple
question:
I don't understand why the key/value assumption is preserved AFTER the
reduce operation, in other words why the output of a reducer is
expected as <key,value> instead of arbitrary, possibly binary bytes?
Why can't OutputCollector just give those raw bytes to the
RecordWriter and have it make sense of them as it pleases, or just
dump them to a file?
This seems like an unnecessary restriction to me, at least at the first
glance.
Thanks,
-Yuri
Re: key/value after reduce
Posted by Fernando Padilla <fe...@alum.mit.edu>.
Well.. I'm no hadoop expert, but let me brainstorm for a little bit:
Aren't there Output classes that take a key-value pair as input, then
they get to decide how/what to actually output. That's how you can
direct the output directly to HBase, etc..
You could create (and hadoop should include by default), a
ValueOutputEncoder, that all it does it output the values, ignoring the
key part.. Thus you get what you want.. not necessarily requiring a
key/value pair output.
You could even have an outputter that took an InputStream as the Value
part.. so that it could stream the output..?? possibly?
How far off is this idea?
There is also nothing holding you back from having your Reducer output
directly to another data/store system. Then "output" of the reducer job
would be empty, or for debug maybe the content-length of what it put in
a different file.. :)
But keep in mind, I think the BIG idea behind Hadoop is divide and
conquer. That means arbitrarily cut up input, transform it once, sort,
transform it once more, output. But the idea is that this should
hopefully support N different output files. I am guessing the key/value
pair arrangement gives those output files context and meaning, or you
wouldn't be able to conceptually put them back together into a coherent
collection of data.
I just remembered, you can force it to only use 1 Reduce job, thus only
one output file, but that won't scale perfectly.. :) But for your
purposes, you could have M map jobs, 1 Reduce job, and use a
ValueOutputEncoder that ignores the key part and only spits out a binary
file.. :)
Yuri Pradkin wrote:
> But OTOH, if I wanted my reducer to write binary output, I'd be
> screwed, especially so in the streaming world (where I'd like to stay
> for the moment).
>
> Actually, I don't think I understand your point: if the reducer's
> output is in a key/value format, you still can run another map over it
> or another reduce, can't you? If the output isn't, you can't; it's up
> to the user who coded up the Reducer. What am I missing?
>
> Thanks,
>
> -Yuri
>
> On Tue 12 2008, Miles Osborne wrote:
>> You may well have another Map operation operate over the Reducer
>> output, in which case you'd want key-value pairs.
>>
>> Miles
>>
>> On 12/02/2008, Yuri Pradkin <yu...@isi.edu> wrote:
>>> Hi,
>>>
>>> I'm relatively new to Hadoop and I have what I hope is a simple
>>> question:
>>>
>>> I don't understand why the key/value assumption is preserved AFTER
>>> the reduce operation, in other words why the output of a reducer
>>> is expected as <key,value> instead of arbitrary, possibly binary
>>> bytes? Why can't OutputCollector just give those raw bytes to the
>>> RecordWriter and have it make sense of them as it pleases, or just
>>> dump them to a file?
>>>
>>> This seems like an unnecessary restriction to me, at least at the
>>> first glance.
>>>
>>> Thanks,
>>>
>>> -Yuri
>
>
Re: key/value after reduce
Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
Actually, you could easily serialise your binary and still continue to use
key-value pairs in the reducer output.
But returning to the question of whether the reducer's output can depart
from key-value pairs; right now, I'm not sure what the answer is. My guess
would be "yes" (--it is open source after all, so you could easily change
the reducer class if necessary (!)). Whether you'd want to is a different
matter. For example, if the key is redundant, then you'd just emit a
constant key per item and ignore it when you came to deal with the output
later.
Miles
On 12/02/2008, Yuri Pradkin <yu...@isi.edu> wrote:
>
> But OTOH, if I wanted my reducer to write binary output, I'd be
> screwed, especially so in the streaming world (where I'd like to stay
> for the moment).
>
> Actually, I don't think I understand your point: if the reducer's
> output is in a key/value format, you still can run another map over it
> or another reduce, can't you? If the output isn't, you can't; it's up
> to the user who coded up the Reducer. What am I missing?
>
> Thanks,
>
> -Yuri
>
> On Tue 12 2008, Miles Osborne wrote:
> > You may well have another Map operation operate over the Reducer
> > output, in which case you'd want key-value pairs.
> >
> > Miles
> >
> > On 12/02/2008, Yuri Pradkin <yu...@isi.edu> wrote:
> > > Hi,
> > >
> > > I'm relatively new to Hadoop and I have what I hope is a simple
> > > question:
> > >
> > > I don't understand why the key/value assumption is preserved AFTER
> > > the reduce operation, in other words why the output of a reducer
> > > is expected as <key,value> instead of arbitrary, possibly binary
> > > bytes? Why can't OutputCollector just give those raw bytes to the
> > > RecordWriter and have it make sense of them as it pleases, or just
> > > dump them to a file?
> > >
> > > This seems like an unnecessary restriction to me, at least at the
> > > first glance.
> > >
> > > Thanks,
> > >
> > > -Yuri
>
>
>
--
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
Re: key/value after reduce
Posted by Yuri Pradkin <yu...@isi.edu>.
But OTOH, if I wanted my reducer to write binary output, I'd be
screwed, especially so in the streaming world (where I'd like to stay
for the moment).
Actually, I don't think I understand your point: if the reducer's
output is in a key/value format, you still can run another map over it
or another reduce, can't you? If the output isn't, you can't; it's up
to the user who coded up the Reducer. What am I missing?
Thanks,
-Yuri
On Tue 12 2008, Miles Osborne wrote:
> You may well have another Map operation operate over the Reducer
> output, in which case you'd want key-value pairs.
>
> Miles
>
> On 12/02/2008, Yuri Pradkin <yu...@isi.edu> wrote:
> > Hi,
> >
> > I'm relatively new to Hadoop and I have what I hope is a simple
> > question:
> >
> > I don't understand why the key/value assumption is preserved AFTER
> > the reduce operation, in other words why the output of a reducer
> > is expected as <key,value> instead of arbitrary, possibly binary
> > bytes? Why can't OutputCollector just give those raw bytes to the
> > RecordWriter and have it make sense of them as it pleases, or just
> > dump them to a file?
> >
> > This seems like an unnecessary restriction to me, at least at the
> > first glance.
> >
> > Thanks,
> >
> > -Yuri
Re: key/value after reduce
Posted by Ted Dunning <td...@veoh.com>.
But that map will have to read the file again (and is likely to want a
different key than the reduce produces).
On 2/12/08 12:33 PM, "Miles Osborne" <mi...@inf.ed.ac.uk> wrote:
> You may well have another Map operation operate over the Reducer output, in
> which case you'd want key-value pairs.
Re: key/value after reduce
Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
You may well have another Map operation operate over the Reducer output, in
which case you'd want key-value pairs.
Miles
On 12/02/2008, Yuri Pradkin <yu...@isi.edu> wrote:
>
> Hi,
>
> I'm relatively new to Hadoop and I have what I hope is a simple
> question:
>
> I don't understand why the key/value assumption is preserved AFTER the
> reduce operation, in other words why the output of a reducer is
> expected as <key,value> instead of arbitrary, possibly binary bytes?
> Why can't OutputCollector just give those raw bytes to the
> RecordWriter and have it make sense of them as it pleases, or just
> dump them to a file?
>
> This seems like an unnecessary restriction to me, at least at the first
> glance.
>
> Thanks,
>
> -Yuri
>
--
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
Re: key/value after reduce
Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 12, 2008, at 12:21 PM, Yuri Pradkin wrote:
> I don't understand why the key/value assumption is preserved AFTER the
> reduce operation, in other words why the output of a reducer is
> expected as <key,value> instead of arbitrary, possibly binary bytes?
Most users don't want to fworry about the serialization of the output
inside the reduce. That is better left to the output format, which is
already doing the record layout.
That said, you could still do it quite easily. Just have the reduce
output BytesWritable keys and values and have the OutputFormat write
them instead.
-- Owen
Re: key/value after reduce
Posted by Ted Dunning <td...@veoh.com>.
Welcome to the club.
The good news is that if you give the output collector a null key, it will
just output the data in the value argument and ignore the key entirely.
Occasionally, the distinction is useful to avoid constructing yet another
temporary data structure to hold a tuple. Word counting is the canonical
example for this where outputting the word and the count naturally fits the
API of the collector.
You are right, however, that the key doesn't serve any key-like function
when it comes out of the reduce.
On 2/12/08 12:21 PM, "Yuri Pradkin" <yu...@isi.edu> wrote:
> I don't understand why the key/value assumption is preserved AFTER the
> reduce operation, in other words why the output of a reducer is
> expected as <key,value> instead of arbitrary, possibly binary bytes?