You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by jeremy p <at...@gmail.com> on 2013/11/29 23:37:19 UTC

How to do aggregate operations with multiple reducers

Hey all,

So, I'm writing a module where I need to do aggregate operations over the
entire set of data.  However, I also want to use multiple reducers.

For example, let's say each row of input data looks like this :

DATE_TIME_TEMPERATURE

Let's say my mapper outputs DATE_TIME as a key, and the temperature as a
value.  In this example, I'm using 3 reducers, which should create three
output files. I want to find the day with the highest temperature in the
entire data set.

I know I could just write a script that examines the output from the
reducers and picks out the value with the highest temperature.  I could
also write a mapreduce job that does the same thing, and chain the two jobs
together.  However, these solutions seem kinda wrong to me.

What's the commonly-accepted best way to do this?

--Jeremy

Re: How to do aggregate operations with multiple reducers

Posted by Adam Kawa <ka...@gmail.com>.
> I know I could just write a script that examines the output from the
> reducers and picks out the value with the highest temperature.
>

In general, I think that this is an acceptable solution.

However, if you have a very large number of reducer (let's say, large
hundreds or thousands), then reading the content of thousands of files from
HDFS can be slow. I would slightly optimize it by using MutlipleOutputs
class (
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html),
and configure it in such a way, that each reducer is writing empty output
to a file named max-temp_reducer-id.part. Then the last file in your output
directory is the one that contains the maximal temperature in first part of
filename (alternatively, you can use a name convention
<MAX-LONG-max-temp>_reducer-id.part, so that the file with the maximal temp
will be the first one in your output directory -> might it be faster?).

Obviously to minimize the number of reduce tasks, you can use Combiner
class in this case, so that reducers will have less data to process.

An alternative way that I can imagine: If you have a couple of reducers
only, each of them can publish maximal temperature as a counter, and when
the job finishes, you can get counters from Job object and find the one
with the highest temperature. Please not that counters are "expensive" (at
least in MRv1), so that you should not have many of them (up to tens, but
rather not hundreds
http://developer.yahoo.com/blogs/hadoop/apache-hadoop-best-practices-anti-patterns-465.html
)

Re: How to do aggregate operations with multiple reducers

Posted by Adam Kawa <ka...@gmail.com>.
> I know I could just write a script that examines the output from the
> reducers and picks out the value with the highest temperature.
>

In general, I think that this is an acceptable solution.

However, if you have a very large number of reducer (let's say, large
hundreds or thousands), then reading the content of thousands of files from
HDFS can be slow. I would slightly optimize it by using MutlipleOutputs
class (
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html),
and configure it in such a way, that each reducer is writing empty output
to a file named max-temp_reducer-id.part. Then the last file in your output
directory is the one that contains the maximal temperature in first part of
filename (alternatively, you can use a name convention
<MAX-LONG-max-temp>_reducer-id.part, so that the file with the maximal temp
will be the first one in your output directory -> might it be faster?).

Obviously to minimize the number of reduce tasks, you can use Combiner
class in this case, so that reducers will have less data to process.

An alternative way that I can imagine: If you have a couple of reducers
only, each of them can publish maximal temperature as a counter, and when
the job finishes, you can get counters from Job object and find the one
with the highest temperature. Please not that counters are "expensive" (at
least in MRv1), so that you should not have many of them (up to tens, but
rather not hundreds
http://developer.yahoo.com/blogs/hadoop/apache-hadoop-best-practices-anti-patterns-465.html
)

Re: How to do aggregate operations with multiple reducers

Posted by Adam Kawa <ka...@gmail.com>.
> I know I could just write a script that examines the output from the
> reducers and picks out the value with the highest temperature.
>

In general, I think that this is an acceptable solution.

However, if you have a very large number of reducer (let's say, large
hundreds or thousands), then reading the content of thousands of files from
HDFS can be slow. I would slightly optimize it by using MutlipleOutputs
class (
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html),
and configure it in such a way, that each reducer is writing empty output
to a file named max-temp_reducer-id.part. Then the last file in your output
directory is the one that contains the maximal temperature in first part of
filename (alternatively, you can use a name convention
<MAX-LONG-max-temp>_reducer-id.part, so that the file with the maximal temp
will be the first one in your output directory -> might it be faster?).

Obviously to minimize the number of reduce tasks, you can use Combiner
class in this case, so that reducers will have less data to process.

An alternative way that I can imagine: If you have a couple of reducers
only, each of them can publish maximal temperature as a counter, and when
the job finishes, you can get counters from Job object and find the one
with the highest temperature. Please not that counters are "expensive" (at
least in MRv1), so that you should not have many of them (up to tens, but
rather not hundreds
http://developer.yahoo.com/blogs/hadoop/apache-hadoop-best-practices-anti-patterns-465.html
)

Re: How to do aggregate operations with multiple reducers

Posted by Adam Kawa <ka...@gmail.com>.
> I know I could just write a script that examines the output from the
> reducers and picks out the value with the highest temperature.
>

In general, I think that this is an acceptable solution.

However, if you have a very large number of reducer (let's say, large
hundreds or thousands), then reading the content of thousands of files from
HDFS can be slow. I would slightly optimize it by using MutlipleOutputs
class (
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html),
and configure it in such a way, that each reducer is writing empty output
to a file named max-temp_reducer-id.part. Then the last file in your output
directory is the one that contains the maximal temperature in first part of
filename (alternatively, you can use a name convention
<MAX-LONG-max-temp>_reducer-id.part, so that the file with the maximal temp
will be the first one in your output directory -> might it be faster?).

Obviously to minimize the number of reduce tasks, you can use Combiner
class in this case, so that reducers will have less data to process.

An alternative way that I can imagine: If you have a couple of reducers
only, each of them can publish maximal temperature as a counter, and when
the job finishes, you can get counters from Job object and find the one
with the highest temperature. Please not that counters are "expensive" (at
least in MRv1), so that you should not have many of them (up to tens, but
rather not hundreds
http://developer.yahoo.com/blogs/hadoop/apache-hadoop-best-practices-anti-patterns-465.html
)