You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Leo Alekseyev <dn...@gmail.com> on 2010/07/21 10:59:27 UTC

Is it possible to use NullWritable in combiner? + general question about combining output from many small maps

Hi All,
I have a job where all processing is done by the mappers, but each
mapper produces a small file, which I want to combine into 3-4 large
ones.  In addition, I only care about the values, not the keys, so
NullWritable key is in order.  I tried using the default reducer
(which according to the docs is identity) by setting
job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and
using a NullWritable key on the mapper output.  However, this seems to
concentrate the work on one reducer only.  I then tried to output
LongWritable as the mapper key, and write a combiner to output
NullWritable (i.e. class GenerateLogLineProtoCombiner extends
Reducer<LongWritable, ProtobufLineMsgWritable, NullWritable,
ProtobufLineMsgWritable>); still using the default reducer.  This gave
me the following error thrown by the combiner:

10/07/21 01:21:38 INFO mapred.JobClient: Task Id :
attempt_201007122205_1058_m_000104_2, Status : FAILED
java.io.IOException: wrong key class: class
org.apache.hadoop.io.NullWritable is not class
org.apache.hadoop.io.LongWritable
        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:164)
          .........

I was able to get things working by explicitly putting in an identity
reducer that takes (LongWritable key, value) and outputs
(NullWritable, value).  However, now most of my processing is in the
reduce phase, which seems like a waste -- it's copying and sorting
data, but all I really need is to "glue" together the small map
outputs.

Thus, my questions are: I don't really understand why the combiner is
throwing an error here.  Does it simply not allow NullWritables on the
output?...
The second question is -- is there a standard strategy for quickly
combining the many small map outputs?  Is it worth, perhaps, to look
into adjusting the min split size for the mappers?.. (can this value
be adjusted dynamically based on the input file size?..)

Thanks to anyone who can give me some pointers :)
--Leo

Re: Is it possible to use NullWritable in combiner? + general question about combining output from many small maps

Posted by Soumya Banerjee <so...@gmail.com>.

Hi,

I think the domain of the types should be the same in the combiner and the
reducer.
Thus you get the error when you try to use different key types for the
combiner and the reducer.

For making multiple Map output  being processed by a reducer, you need to
emit a key from the mapper and then write a custom partitioner to partition
based on that key and which will send all the map outputs with that specific
key to one reducer.

Say if you want three map outputs to be processed by a reducer then have all
of them emit the same key(this can be anything).

Regards,
Soumya.

On Wed, Jul 21, 2010 at 11:50 PM, Alex Kozlov <al...@cloudera.com> wrote:

> Hi Leo,
>
> I am confused: how do you want to partition the work between multiple
> reducers if the map emitted key is NULL?  If you don't, say you want to
> reduce everything in one reducer, then the key type/value should not
> matter:
> just emit a constant of any type and discard it later on.
>
> Alex K
>
> On Wed, Jul 21, 2010 at 1:59 AM, Leo Alekseyev <dn...@gmail.com> wrote:
>
> > Hi All,
> > I have a job where all processing is done by the mappers, but each
> > mapper produces a small file, which I want to combine into 3-4 large
> > ones.  In addition, I only care about the values, not the keys, so
> > NullWritable key is in order.  I tried using the default reducer
> > (which according to the docs is identity) by setting
> > job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and
> > using a NullWritable key on the mapper output.  However, this seems to
> > concentrate the work on one reducer only.  I then tried to output
> > LongWritable as the mapper key, and write a combiner to output
> > NullWritable (i.e. class GenerateLogLineProtoCombiner extends
> > Reducer<LongWritable, ProtobufLineMsgWritable, NullWritable,
> > ProtobufLineMsgWritable>); still using the default reducer.  This gave
> > me the following error thrown by the combiner:
> >
> > 10/07/21 01:21:38 INFO mapred.JobClient: Task Id :
> > attempt_201007122205_1058_m_000104_2, Status : FAILED
> > java.io.IOException: wrong key class: class
> > org.apache.hadoop.io.NullWritable is not class
> > org.apache.hadoop.io.LongWritable
> >        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:164)
> >          .........
> >
> > I was able to get things working by explicitly putting in an identity
> > reducer that takes (LongWritable key, value) and outputs
> > (NullWritable, value).  However, now most of my processing is in the
> > reduce phase, which seems like a waste -- it's copying and sorting
> > data, but all I really need is to "glue" together the small map
> > outputs.
> >
> > Thus, my questions are: I don't really understand why the combiner is
> > throwing an error here.  Does it simply not allow NullWritables on the
> > output?...
> > The second question is -- is there a standard strategy for quickly
> > combining the many small map outputs?  Is it worth, perhaps, to look
> > into adjusting the min split size for the mappers?.. (can this value
> > be adjusted dynamically based on the input file size?..)
> >
> > Thanks to anyone who can give me some pointers :)
> > --Leo
> >
>

Re: Is it possible to use NullWritable in combiner? + general question about combining output from many small maps

Posted by Alex Kozlov <al...@cloudera.com>.

Hi Leo,

I am confused: how do you want to partition the work between multiple
reducers if the map emitted key is NULL?  If you don't, say you want to
reduce everything in one reducer, then the key type/value should not matter:
just emit a constant of any type and discard it later on.

Alex K

On Wed, Jul 21, 2010 at 1:59 AM, Leo Alekseyev <dn...@gmail.com> wrote:

> Hi All,
> I have a job where all processing is done by the mappers, but each
> mapper produces a small file, which I want to combine into 3-4 large
> ones.  In addition, I only care about the values, not the keys, so
> NullWritable key is in order.  I tried using the default reducer
> (which according to the docs is identity) by setting
> job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and
> using a NullWritable key on the mapper output.  However, this seems to
> concentrate the work on one reducer only.  I then tried to output
> LongWritable as the mapper key, and write a combiner to output
> NullWritable (i.e. class GenerateLogLineProtoCombiner extends
> Reducer<LongWritable, ProtobufLineMsgWritable, NullWritable,
> ProtobufLineMsgWritable>); still using the default reducer.  This gave
> me the following error thrown by the combiner:
>
> 10/07/21 01:21:38 INFO mapred.JobClient: Task Id :
> attempt_201007122205_1058_m_000104_2, Status : FAILED
> java.io.IOException: wrong key class: class
> org.apache.hadoop.io.NullWritable is not class
> org.apache.hadoop.io.LongWritable
>        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:164)
>          .........
>
> I was able to get things working by explicitly putting in an identity
> reducer that takes (LongWritable key, value) and outputs
> (NullWritable, value).  However, now most of my processing is in the
> reduce phase, which seems like a waste -- it's copying and sorting
> data, but all I really need is to "glue" together the small map
> outputs.
>
> Thus, my questions are: I don't really understand why the combiner is
> throwing an error here.  Does it simply not allow NullWritables on the
> output?...
> The second question is -- is there a standard strategy for quickly
> combining the many small map outputs?  Is it worth, perhaps, to look
> into adjusting the min split size for the mappers?.. (can this value
> be adjusted dynamically based on the input file size?..)
>
> Thanks to anyone who can give me some pointers :)
> --Leo
>

Re: Is it possible to use NullWritable in combiner? + general question about combining output from many small maps

Posted by Alex Kozlov <al...@cloudera.com>.

Hi Leo,

The splits are determined by *InputFormat.getSplits()* method.  There is a
good discussion on the input formats and splits in Chapter 7 of the THDG
book by Tom White, which I highly recommend (specifically section 'Input
Formats').  There are a number of parameters you can tweak like *
mapred.{min,max}.split.size*.  I probably should not go further here since
the interface changes sometime and depends on the version you are using and
I haven't tested all the options there.  It also depends on your specific
needs as well and you might want to write your own InputFormat.  However, I
will say that the most effective way to make the mapper working with larger
chunks is to just increase the input file block size (you can do this on the
file level).

Looking at the discussion here it also seems your might want to look at
map-side aggregation in the mapper.

You are right: shuffle can be a lot of overhead since it is also doing
sorting.

Alex K

On Thu, Jul 22, 2010 at 1:39 PM, Leo Alekseyev <dn...@gmail.com> wrote:

> Thanks for everybody's responses.  I think I've got things sorted out
> for the time being; some folks were asking me for clarification of my
> problem, so let me elaborate, for the list archives if nothing else.
> In brief, say I have 1000
> mappers each outputting 20 MB chunks; my problem doesn't require a
> reduce step, but I'm not happy with the file being partitioned in many
> small chunks smaller than DFS block size.  But when I tried running
> e.g. 4 reducers so that I get 4 5 GB files at the end, the reduce step
> took quite a bit longer than the map step, most of it being network
> traffic during the shuffle step.  It seemed wasteful to me, in light
> of the fact that the reducers' only purpose was to "glue together" the
> small files.
>
> I am guessing that there's no way to
> get around this -- when using reducers you'll have to be sending
> chunks to machines that combine them.  It is possible, however, to
> tweak the size of each map's output (and thus the number of mappers)
> by adjusting min split input size; for some of my jobs it's proving to
> be a good solution
>
> --Leo
>
> On Wed, Jul 21, 2010 at 2:57 AM, Himanshu Vashishtha
> <va...@gmail.com> wrote:
> > Please see my comments in-line, as per my understanding of Hadoop & your
> > problems. See if they are helpful.
> >
> > Cheers,
> > Himanshu
> >
> > On Wed, Jul 21, 2010 at 2:59 AM, Leo Alekseyev <dn...@gmail.com>
> wrote:
> >
> >> Hi All,
> >> I have a job where all processing is done by the mappers, but each
> >> mapper produces a small file, which I want to combine into 3-4 large
> >> ones.  In addition, I only care about the values, not the keys, so
> >> NullWritable key is in order.  I tried using the default reducer
> >> (which according to the docs is identity) by setting
> >> job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and
> >> using a NullWritable key on the mapper output.  However, this seems to
> >> concentrate the work on one reducer only.
> >
> >
> > NullWritable is a singleton class. So, the entire map output related to
> it
> > will go to a single reduce node.
> >
> >
> >> I then tried to output
> >> LongWritable as the mapper key, and write a combiner to output
> >> NullWritable (i.e. class GenerateLogLineProtoCombiner extends
> >> Reducer<LongWritable, ProtobufLineMsgWritable, NullWritable,
> >> ProtobufLineMsgWritable>); still using the default reducer.  This gave
> >> me the following error thrown by the combiner:
> >>
> >> 10/07/21 01:21:38 INFO mapred.JobClient: Task Id :
> >> attempt_201007122205_1058_m_000104_2, Status : FAILED
> >> java.io.IOException: wrong key class: class
> >> org.apache.hadoop.io.NullWritable is not class
> >> org.apache.hadoop.io.LongWritable
> >>        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:164)
> >>          .........
> >>
> >> A combiner goal is to lessen the  reducer's workload. Ideally, its
> output
> > key-value should be same as that of Mapper's output key-value. Therefore
> the
> > error.
> >
> >> I was able to get things working by explicitly putting in an identity
> >> reducer that takes (LongWritable key, value) and outputs
> >> (NullWritable, value).  However, now most of my processing is in the
> >> reduce phase, which seems like a waste -- it's copying and sorting
> >> data, but all I really need is to "glue" together the small map
> >> outputs.
> >>
> >> Thus, my questions are: I don't really understand why the combiner is
> >> throwing an error here.  Does it simply not allow NullWritables on the
> >> output?...
> >> The second question is -- is there a standard strategy for quickly
> >> combining the many small map outputs?  Is it worth, perhaps, to look
> >> into adjusting the min split size for the mappers?.. (can this value
> >> be adjusted dynamically based on the input file size?..)
> >>
> >> I don't know of any such strategy. How about defining a smaller number
> of
> > reducers. I am also not able to understand teh problem. It will be great
> if
> > you are a bit more specific (in terms of map file input and output size,
> and
> > reduce output size).
> >
> >
> >> Thanks to anyone who can give me some pointers :)
> >> --Leo
> >>
> >
>

Re: Is it possible to use NullWritable in combiner? + general question about combining output from many small maps

Posted by Leo Alekseyev <dn...@gmail.com>.

Thanks for everybody's responses.  I think I've got things sorted out
for the time being; some folks were asking me for clarification of my
problem, so let me elaborate, for the list archives if nothing else.
In brief, say I have 1000
mappers each outputting 20 MB chunks; my problem doesn't require a
reduce step, but I'm not happy with the file being partitioned in many
small chunks smaller than DFS block size.  But when I tried running
e.g. 4 reducers so that I get 4 5 GB files at the end, the reduce step
took quite a bit longer than the map step, most of it being network
traffic during the shuffle step.  It seemed wasteful to me, in light
of the fact that the reducers' only purpose was to "glue together" the
small files.

I am guessing that there's no way to
get around this -- when using reducers you'll have to be sending
chunks to machines that combine them.  It is possible, however, to
tweak the size of each map's output (and thus the number of mappers)
by adjusting min split input size; for some of my jobs it's proving to
be a good solution

--Leo

On Wed, Jul 21, 2010 at 2:57 AM, Himanshu Vashishtha
<va...@gmail.com> wrote:
> Please see my comments in-line, as per my understanding of Hadoop & your
> problems. See if they are helpful.
>
> Cheers,
> Himanshu
>
> On Wed, Jul 21, 2010 at 2:59 AM, Leo Alekseyev <dn...@gmail.com> wrote:
>
>> Hi All,
>> I have a job where all processing is done by the mappers, but each
>> mapper produces a small file, which I want to combine into 3-4 large
>> ones.  In addition, I only care about the values, not the keys, so
>> NullWritable key is in order.  I tried using the default reducer
>> (which according to the docs is identity) by setting
>> job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and
>> using a NullWritable key on the mapper output.  However, this seems to
>> concentrate the work on one reducer only.
>
>
> NullWritable is a singleton class. So, the entire map output related to it
> will go to a single reduce node.
>
>
>> I then tried to output
>> LongWritable as the mapper key, and write a combiner to output
>> NullWritable (i.e. class GenerateLogLineProtoCombiner extends
>> Reducer<LongWritable, ProtobufLineMsgWritable, NullWritable,
>> ProtobufLineMsgWritable>); still using the default reducer.  This gave
>> me the following error thrown by the combiner:
>>
>> 10/07/21 01:21:38 INFO mapred.JobClient: Task Id :
>> attempt_201007122205_1058_m_000104_2, Status : FAILED
>> java.io.IOException: wrong key class: class
>> org.apache.hadoop.io.NullWritable is not class
>> org.apache.hadoop.io.LongWritable
>>        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:164)
>>          .........
>>
>> A combiner goal is to lessen the  reducer's workload. Ideally, its output
> key-value should be same as that of Mapper's output key-value. Therefore the
> error.
>
>> I was able to get things working by explicitly putting in an identity
>> reducer that takes (LongWritable key, value) and outputs
>> (NullWritable, value).  However, now most of my processing is in the
>> reduce phase, which seems like a waste -- it's copying and sorting
>> data, but all I really need is to "glue" together the small map
>> outputs.
>>
>> Thus, my questions are: I don't really understand why the combiner is
>> throwing an error here.  Does it simply not allow NullWritables on the
>> output?...
>> The second question is -- is there a standard strategy for quickly
>> combining the many small map outputs?  Is it worth, perhaps, to look
>> into adjusting the min split size for the mappers?.. (can this value
>> be adjusted dynamically based on the input file size?..)
>>
>> I don't know of any such strategy. How about defining a smaller number of
> reducers. I am also not able to understand teh problem. It will be great if
> you are a bit more specific (in terms of map file input and output size, and
> reduce output size).
>
>
>> Thanks to anyone who can give me some pointers :)
>> --Leo
>>
>

Re: Is it possible to use NullWritable in combiner? + general question about combining output from many small maps

Posted by Himanshu Vashishtha <va...@gmail.com>.

Please see my comments in-line, as per my understanding of Hadoop & your
problems. See if they are helpful.

Cheers,
Himanshu

On Wed, Jul 21, 2010 at 2:59 AM, Leo Alekseyev <dn...@gmail.com> wrote:

> Hi All,
> I have a job where all processing is done by the mappers, but each
> mapper produces a small file, which I want to combine into 3-4 large
> ones.  In addition, I only care about the values, not the keys, so
> NullWritable key is in order.  I tried using the default reducer
> (which according to the docs is identity) by setting
> job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and
> using a NullWritable key on the mapper output.  However, this seems to
> concentrate the work on one reducer only.


NullWritable is a singleton class. So, the entire map output related to it
will go to a single reduce node.


> I then tried to output
> LongWritable as the mapper key, and write a combiner to output
> NullWritable (i.e. class GenerateLogLineProtoCombiner extends
> Reducer<LongWritable, ProtobufLineMsgWritable, NullWritable,
> ProtobufLineMsgWritable>); still using the default reducer.  This gave
> me the following error thrown by the combiner:
>
> 10/07/21 01:21:38 INFO mapred.JobClient: Task Id :
> attempt_201007122205_1058_m_000104_2, Status : FAILED
> java.io.IOException: wrong key class: class
> org.apache.hadoop.io.NullWritable is not class
> org.apache.hadoop.io.LongWritable
>        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:164)
>          .........
>
> A combiner goal is to lessen the  reducer's workload. Ideally, its output
key-value should be same as that of Mapper's output key-value. Therefore the
error.

> I was able to get things working by explicitly putting in an identity
> reducer that takes (LongWritable key, value) and outputs
> (NullWritable, value).  However, now most of my processing is in the
> reduce phase, which seems like a waste -- it's copying and sorting
> data, but all I really need is to "glue" together the small map
> outputs.
>
> Thus, my questions are: I don't really understand why the combiner is
> throwing an error here.  Does it simply not allow NullWritables on the
> output?...
> The second question is -- is there a standard strategy for quickly
> combining the many small map outputs?  Is it worth, perhaps, to look
> into adjusting the min split size for the mappers?.. (can this value
> be adjusted dynamically based on the input file size?..)
>
> I don't know of any such strategy. How about defining a smaller number of
reducers. I am also not able to understand teh problem. It will be great if
you are a bit more specific (in terms of map file input and output size, and
reduce output size).


> Thanks to anyone who can give me some pointers :)
> --Leo
>