You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2012/01/24 18:33:55 UTC

When to use a combiner?

In working a sample issue I used a combiner - I noticed that the Combiner
output records were 90% of the Combiner Input records and
when looking at the data found relatively few duplicated keys. This raises
the question of what fraction of duplicate keys makes it reasonable to
use a combiner - If every key is unique I presume that using a combiner
will waste time and resources - especially if the data is large but
what fraction of duplicated keys is needed to justify a combiner??

-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: When to use a combiner?

Posted by Raj V <ra...@yahoo.com>.
Touche`!

Raj



>________________________________
> From: Robert Evans <ev...@yahoo-inc.com>
>To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>; Raj V <ra...@yahoo.com> 
>Sent: Wednesday, January 25, 2012 7:36 AM
>Subject: Re: When to use a combiner?
> 
>
>Re: When to use a combiner? 
>You can use a combiner for average.  You just have to write a separate combiner from your reducer.
>
>Class myCombiner {
>    //The value is sum/count pairs
>    void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
>        long sum = 0;
>        long count = 0;
>        for(Pair<Long, Long> value: values) {
>            sum += pair.first;
>            count += pair.second;
>        }
>        context.write(key, new Pair<Long, Long>(sum, count));
>    }
>}
>
>Class myReducer {
>    //The value is sum/count pairs
>    void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
>        long sum = 0;
>        long count = 0;
>        for(Pair<Long, Long> value: values) {
>            sum += pair.first;
>            count += pair.second;
>        }
>        context.write(key, ((double)sum)/count);
>    }
>}
>
>--Bobby Evans
>
>
>On 1/24/12 4:34 PM, "Raj V" <ra...@yahoo.com> wrote:
>
>
>Just to add to Sameer's response - you cannot use a combiner in case you are finding the average  temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages.
>>
>>You can  use a combiner if your reducer function R is like this
>>
>>R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S.
>>
>>Raj 
>>
>>
>> 
>>> 
>>>
>>> 
>>>>>>________________________________
>>>  From:Sameer Farooqui <sa...@hortonworks.com>
>>> To: common-user@hadoop.apache.org 
>>> Sent: Tuesday, January 24, 2012 12:22 PM
>>> Subject: Re: When to use a combiner?
>>> 
>>> 
>>> 
>>>Hi Steve,
>>>
>>>Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see.
>>>
>>>>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI."
>>>
>>>One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so:
>>>
>>>Node 1's Map output:
>>>(1950, 20)
>>>(1950, 10)
>>>(1950, 40)
>>>
>>>Node 2's Map output:
>>>(1950, 0)
>>>(1950, 15)
>>>
>>>The reduce function would get this input after the shuffle phase:
>>>(1950, [0, 10, 15, 20, 40])
>>>and the reduce function would output:
>>>(1950, 40)
>>>
>>>But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase:
>>>(1950, [40, 15]) 
>>>and the output from Reduce would be the same.
>>>
>>>There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time.
>>>
>>>Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them.
>>>
>
>

Re: When to use a combiner?

Posted by Robert Evans <ev...@yahoo-inc.com>.
You can use a combiner for average.  You just have to write a separate combiner from your reducer.

Class myCombiner {
    //The value is sum/count pairs
    void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
        long sum = 0;
        long count = 0;
        for(Pair<Long, Long> value: values) {
            sum += pair.first;
            count += pair.second;
        }
        context.write(key, new Pair<Long, Long>(sum, count));
    }
}

Class myReducer {
    //The value is sum/count pairs
    void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
        long sum = 0;
        long count = 0;
        for(Pair<Long, Long> value: values) {
            sum += pair.first;
            count += pair.second;
        }
        context.write(key, ((double)sum)/count);
    }
}

--Bobby Evans


On 1/24/12 4:34 PM, "Raj V" <ra...@yahoo.com> wrote:

Just to add to Sameer's response - you cannot use a combiner in case you are finding the average  temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages.

You can  use a combiner if your reducer function R is like this

R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S.

Raj





________________________________
 From: Sameer Farooqui <sa...@hortonworks.com>
 To: common-user@hadoop.apache.org
 Sent: Tuesday, January 24, 2012 12:22 PM
 Subject: Re: When to use a combiner?



Hi Steve,

Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see.

>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI."

One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so:

Node 1's Map output:
(1950, 20)
(1950, 10)
(1950, 40)

Node 2's Map output:
(1950, 0)
(1950, 15)

The reduce function would get this input after the shuffle phase:
(1950, [0, 10, 15, 20, 40])
and the reduce function would output:
(1950, 40)

But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase:
(1950, [40, 15])
and the output from Reduce would be the same.

There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time.

Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them.

Re: When to use a combiner?

Posted by Raj V <ra...@yahoo.com>.
Just to add to Sameer's response - you cannot use a combiner in case you are finding the average  temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages.

You can  use a combiner if your reducer function R is like this

R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S.

Raj 



>________________________________
> From: Sameer Farooqui <sa...@hortonworks.com>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, January 24, 2012 12:22 PM
>Subject: Re: When to use a combiner?
> 
>
>Hi Steve,
>
>Yeah, you're right in your suspicions that a combiner may not be useful 
in your use case. It's mainly used to reduce network traffic between the
 mappers and the reducers. Hadoop may apply the combiner zero, one or 
multiple times to the intermediate output from the mapper, so it's hard 
to accurately predict the CPU impact a combiner will have. The reduction
 in network packets is a lot easier to predict and actually see.
>
>>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily 
improve performance. You should monitor the job's behavior to see if the
 number of records outputted by the combiner is meaningfully less than 
the number of records going in. The reduction must justify the extra 
execution time of running a combiner. You can easily check this through 
the JobTracker's Web UI."
>
>One thing to point out is don't just assume the combiner's 
ineffectiveness b/c it's not reducing the # of unique keys emitted from 
the Map side. It really depends on your specific use case for the 
combiner and the nature of the MapReduce job. For example, imagine your 
map tasks find the maximum temperature for a given year (example from 
'Hadoop: The Definitive Guide'), like so:
>
>Node 1's Map output:
>(1950, 20)
>(1950, 10)
>(1950, 40)
>
>Node 2's Map output:
>(1950, 0)
>(1950, 15)
>
>The reduce function would get this input after the shuffle phase:
>(1950, [0, 10, 15, 20, 40])
>and the reduce function would output:
>(1950, 40)
>
>But if you used a combiner, the reduce function would have gotten 
smaller input to work with after the shuffle phase:
>(1950, [40, 15]) 
>and the output from Reduce would be the same.
>
>There are specific use cases like the one above that a combiner makes 
magical performance gains for, but it shouldn't by default be used 100% 
of the time.
>
>Both of the books I mentioned are excellent with tons of real-world 
tips, so I highly recommend them.
>
>
>-- 
>Sameer Farooqui
>Systems Architect / HortonWorks
>
>
>
>Steve Lewis
>>January 24, 2012 
9:33 AM
>>In working a sample issue I used a combiner - I noticed that the Combiner
>>output records were 
90% of the Combiner Input records and
>>when looking at the data found 
relatively few duplicated keys. This raises
>>the question of what 
fraction of duplicate keys makes it reasonable to
>>use a combiner - If
 every key is unique I presume that using a combiner
>>will waste time 
and resources - especially if the data is large but
>>what fraction of 
duplicated keys is needed to justify a combiner??
>>
>>
>
>
>
>
>

Re: When to use a combiner?

Posted by Sameer Farooqui <sa...@hortonworks.com>.
Hi Steve,

Yeah, you're right in your suspicions that a combiner may not be useful 
in your use case. It's mainly used to reduce network traffic between the 
mappers and the reducers. Hadoop may apply the combiner zero, one or 
multiple times to the intermediate output from the mapper, so it's hard 
to accurately predict the CPU impact a combiner will have. The reduction 
in network packets is a lot easier to predict and actually see.

 From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily 
improve performance. You should monitor the job's behavior to see if the 
number of records outputted by the combiner is meaningfully less than 
the number of records going in. The reduction must justify the extra 
execution time of running a combiner. You can easily check this through 
the JobTracker's Web UI."

One thing to point out is don't just assume the combiner's 
ineffectiveness b/c it's not reducing the # of unique keys emitted from 
the Map side. It really depends on your specific use case for the 
combiner and the nature of the MapReduce job. For example, imagine your 
map tasks find the maximum temperature for a given year (example from 
'Hadoop: The Definitive Guide'), like so:

Node 1's Map output:
(1950, 20)
(1950, 10)
(1950, 40)

Node 2's Map output:
(1950, 0)
(1950, 15)

The reduce function would get this input after the shuffle phase:
(1950, [0, 10, 15, 20, 40])
and the reduce function would output:
(1950, 40)

But if you used a combiner, the reduce function would have gotten 
smaller input to work with after the shuffle phase:
(1950, [40, 15])
and the output from Reduce would be the same.

There are specific use cases like the one above that a combiner makes 
magical performance gains for, but it shouldn't by default be used 100% 
of the time.

Both of the books I mentioned are excellent with tons of real-world 
tips, so I highly recommend them.

-- 
Sameer Farooqui
Systems Architect / HortonWorks


> Steve Lewis <ma...@gmail.com>
> January 24, 2012 9:33 AM
> In working a sample issue I used a combiner - I noticed that the Combiner
> output records were 90% of the Combiner Input records and
> when looking at the data found relatively few duplicated keys. This raises
> the question of what fraction of duplicate keys makes it reasonable to
> use a combiner - If every key is unique I presume that using a combiner
> will waste time and resources - especially if the data is large but
> what fraction of duplicated keys is needed to justify a combiner??
>