You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2012/01/24 18:33:55 UTC
When to use a combiner?
In working a sample issue I used a combiner - I noticed that the Combiner
output records were 90% of the Combiner Input records and
when looking at the data found relatively few duplicated keys. This raises
the question of what fraction of duplicate keys makes it reasonable to
use a combiner - If every key is unique I presume that using a combiner
will waste time and resources - especially if the data is large but
what fraction of duplicated keys is needed to justify a combiner??
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
Re: When to use a combiner?
Posted by Raj V <ra...@yahoo.com>.
Touche`!
Raj
>________________________________
> From: Robert Evans <ev...@yahoo-inc.com>
>To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>; Raj V <ra...@yahoo.com>
>Sent: Wednesday, January 25, 2012 7:36 AM
>Subject: Re: When to use a combiner?
>
>
>Re: When to use a combiner?
>You can use a combiner for average. You just have to write a separate combiner from your reducer.
>
>Class myCombiner {
> //The value is sum/count pairs
> void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
> long sum = 0;
> long count = 0;
> for(Pair<Long, Long> value: values) {
> sum += pair.first;
> count += pair.second;
> }
> context.write(key, new Pair<Long, Long>(sum, count));
> }
>}
>
>Class myReducer {
> //The value is sum/count pairs
> void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
> long sum = 0;
> long count = 0;
> for(Pair<Long, Long> value: values) {
> sum += pair.first;
> count += pair.second;
> }
> context.write(key, ((double)sum)/count);
> }
>}
>
>--Bobby Evans
>
>
>On 1/24/12 4:34 PM, "Raj V" <ra...@yahoo.com> wrote:
>
>
>Just to add to Sameer's response - you cannot use a combiner in case you are finding the average temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages.
>>
>>You can use a combiner if your reducer function R is like this
>>
>>R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S.
>>
>>Raj
>>
>>
>>
>>>
>>>
>>>
>>>>>>________________________________
>>> From:Sameer Farooqui <sa...@hortonworks.com>
>>> To: common-user@hadoop.apache.org
>>> Sent: Tuesday, January 24, 2012 12:22 PM
>>> Subject: Re: When to use a combiner?
>>>
>>>
>>>
>>>Hi Steve,
>>>
>>>Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see.
>>>
>>>>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI."
>>>
>>>One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so:
>>>
>>>Node 1's Map output:
>>>(1950, 20)
>>>(1950, 10)
>>>(1950, 40)
>>>
>>>Node 2's Map output:
>>>(1950, 0)
>>>(1950, 15)
>>>
>>>The reduce function would get this input after the shuffle phase:
>>>(1950, [0, 10, 15, 20, 40])
>>>and the reduce function would output:
>>>(1950, 40)
>>>
>>>But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase:
>>>(1950, [40, 15])
>>>and the output from Reduce would be the same.
>>>
>>>There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time.
>>>
>>>Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them.
>>>
>
>
Re: When to use a combiner?
Posted by Robert Evans <ev...@yahoo-inc.com>.
You can use a combiner for average. You just have to write a separate combiner from your reducer.
Class myCombiner {
//The value is sum/count pairs
void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
long sum = 0;
long count = 0;
for(Pair<Long, Long> value: values) {
sum += pair.first;
count += pair.second;
}
context.write(key, new Pair<Long, Long>(sum, count));
}
}
Class myReducer {
//The value is sum/count pairs
void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) {
long sum = 0;
long count = 0;
for(Pair<Long, Long> value: values) {
sum += pair.first;
count += pair.second;
}
context.write(key, ((double)sum)/count);
}
}
--Bobby Evans
On 1/24/12 4:34 PM, "Raj V" <ra...@yahoo.com> wrote:
Just to add to Sameer's response - you cannot use a combiner in case you are finding the average temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages.
You can use a combiner if your reducer function R is like this
R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S.
Raj
________________________________
From: Sameer Farooqui <sa...@hortonworks.com>
To: common-user@hadoop.apache.org
Sent: Tuesday, January 24, 2012 12:22 PM
Subject: Re: When to use a combiner?
Hi Steve,
Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see.
>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI."
One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so:
Node 1's Map output:
(1950, 20)
(1950, 10)
(1950, 40)
Node 2's Map output:
(1950, 0)
(1950, 15)
The reduce function would get this input after the shuffle phase:
(1950, [0, 10, 15, 20, 40])
and the reduce function would output:
(1950, 40)
But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase:
(1950, [40, 15])
and the output from Reduce would be the same.
There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time.
Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them.
Re: When to use a combiner?
Posted by Raj V <ra...@yahoo.com>.
Just to add to Sameer's response - you cannot use a combiner in case you are finding the average temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages.
You can use a combiner if your reducer function R is like this
R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S.
Raj
>________________________________
> From: Sameer Farooqui <sa...@hortonworks.com>
>To: common-user@hadoop.apache.org
>Sent: Tuesday, January 24, 2012 12:22 PM
>Subject: Re: When to use a combiner?
>
>
>Hi Steve,
>
>Yeah, you're right in your suspicions that a combiner may not be useful
in your use case. It's mainly used to reduce network traffic between the
mappers and the reducers. Hadoop may apply the combiner zero, one or
multiple times to the intermediate output from the mapper, so it's hard
to accurately predict the CPU impact a combiner will have. The reduction
in network packets is a lot easier to predict and actually see.
>
>>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily
improve performance. You should monitor the job's behavior to see if the
number of records outputted by the combiner is meaningfully less than
the number of records going in. The reduction must justify the extra
execution time of running a combiner. You can easily check this through
the JobTracker's Web UI."
>
>One thing to point out is don't just assume the combiner's
ineffectiveness b/c it's not reducing the # of unique keys emitted from
the Map side. It really depends on your specific use case for the
combiner and the nature of the MapReduce job. For example, imagine your
map tasks find the maximum temperature for a given year (example from
'Hadoop: The Definitive Guide'), like so:
>
>Node 1's Map output:
>(1950, 20)
>(1950, 10)
>(1950, 40)
>
>Node 2's Map output:
>(1950, 0)
>(1950, 15)
>
>The reduce function would get this input after the shuffle phase:
>(1950, [0, 10, 15, 20, 40])
>and the reduce function would output:
>(1950, 40)
>
>But if you used a combiner, the reduce function would have gotten
smaller input to work with after the shuffle phase:
>(1950, [40, 15])
>and the output from Reduce would be the same.
>
>There are specific use cases like the one above that a combiner makes
magical performance gains for, but it shouldn't by default be used 100%
of the time.
>
>Both of the books I mentioned are excellent with tons of real-world
tips, so I highly recommend them.
>
>
>--
>Sameer Farooqui
>Systems Architect / HortonWorks
>
>
>
>Steve Lewis
>>January 24, 2012
9:33 AM
>>In working a sample issue I used a combiner - I noticed that the Combiner
>>output records were
90% of the Combiner Input records and
>>when looking at the data found
relatively few duplicated keys. This raises
>>the question of what
fraction of duplicate keys makes it reasonable to
>>use a combiner - If
every key is unique I presume that using a combiner
>>will waste time
and resources - especially if the data is large but
>>what fraction of
duplicated keys is needed to justify a combiner??
>>
>>
>
>
>
>
>
Re: When to use a combiner?
Posted by Sameer Farooqui <sa...@hortonworks.com>.
Hi Steve,
Yeah, you're right in your suspicions that a combiner may not be useful
in your use case. It's mainly used to reduce network traffic between the
mappers and the reducers. Hadoop may apply the combiner zero, one or
multiple times to the intermediate output from the mapper, so it's hard
to accurately predict the CPU impact a combiner will have. The reduction
in network packets is a lot easier to predict and actually see.
From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily
improve performance. You should monitor the job's behavior to see if the
number of records outputted by the combiner is meaningfully less than
the number of records going in. The reduction must justify the extra
execution time of running a combiner. You can easily check this through
the JobTracker's Web UI."
One thing to point out is don't just assume the combiner's
ineffectiveness b/c it's not reducing the # of unique keys emitted from
the Map side. It really depends on your specific use case for the
combiner and the nature of the MapReduce job. For example, imagine your
map tasks find the maximum temperature for a given year (example from
'Hadoop: The Definitive Guide'), like so:
Node 1's Map output:
(1950, 20)
(1950, 10)
(1950, 40)
Node 2's Map output:
(1950, 0)
(1950, 15)
The reduce function would get this input after the shuffle phase:
(1950, [0, 10, 15, 20, 40])
and the reduce function would output:
(1950, 40)
But if you used a combiner, the reduce function would have gotten
smaller input to work with after the shuffle phase:
(1950, [40, 15])
and the output from Reduce would be the same.
There are specific use cases like the one above that a combiner makes
magical performance gains for, but it shouldn't by default be used 100%
of the time.
Both of the books I mentioned are excellent with tons of real-world
tips, so I highly recommend them.
--
Sameer Farooqui
Systems Architect / HortonWorks
> Steve Lewis <ma...@gmail.com>
> January 24, 2012 9:33 AM
> In working a sample issue I used a combiner - I noticed that the Combiner
> output records were 90% of the Combiner Input records and
> when looking at the data found relatively few duplicated keys. This raises
> the question of what fraction of duplicate keys makes it reasonable to
> use a combiner - If every key is unique I presume that using a combiner
> will waste time and resources - especially if the data is large but
> what fraction of duplicated keys is needed to justify a combiner??
>