You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tarandeep Singh <ta...@gmail.com> on 2008/05/27 23:53:53 UTC

Question - Sum of a col in log file - memory requirement in reducer

Hi,

Is it correct that an intermediate key from a mapper goes to only 1 reducer ?
If yes, then if I have to sum up values of some col in a log file, a
reducer will consume a lot of memory -

I have a simple requirement - to sum up the values of one of the
column in the log files.
Suppose the log file has structure -

C1 C2 C3

and I want sum of all C2 cols..

I thought of this Map Reduce solution -

Map- Key = 1
        Value = C2

At Reduce I will get - ( 1, [values of C2 col] )

Now my question is if log file has billion of records, so this would
mean effectively one reducer will receive a list of billion values.
How is reducer going to handle such huge data ?
If my approach is wrong, please suggest an alternative.

Thanks,
Taran

Re: Question - Sum of a col in log file - memory requirement in reducer

Posted by Tarandeep Singh <ta...@gmail.com>.

On Tue, May 27, 2008 at 3:58 PM, Ted Dunning <td...@veoh.com> wrote:
>
> Your approach is close to correct.
>
> You can also define what is called a combiner which is very much like a
> reducer except that it is given only the partial set of results that are
> output from the mapper.
>

oops, I totally forgot about combiner, I remember reading about it but
totally forgot.
Yes, now it makes sense, the combiners would reduce the load of single
reducer a lot.

Thanks Ted !

-Taran

> For associative functions like addition, you can add up groups of numbers
> and then add up those sums to get the same answer.  This means that you can
> have the same function for reducer and combiner.  What will happen is that
> the mapper will select the column as you describe, but the combiner will be
> run on those values so that the reducer will see one value per mapper.  For
> very large inputs, this will typically be only a few thousand or a few tens
> of thousands of rows.  There would only be a single output row.  This means
> that the overhead of the single reducer will negligible.
>
> This is analogous to the famous word counting example.
>
> Another interesting thing you can do is if you have inputs like this:
>
>   C1 C2 C3
>
> The map can output multiple records, thus,
>
>    1  C1
>    2  C2
>    3  C3
>
> The reducer and combiner would be as before except that there would be
> roughly 3 times as many intermediate rows and there would be 3 output rows.
>
> Does this help?
>
> On 5/27/08 2:53 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it correct that an intermediate key from a mapper goes to only 1 reducer ?
>> If yes, then if I have to sum up values of some col in a log file, a
>> reducer will consume a lot of memory -
>>
>> I have a simple requirement - to sum up the values of one of the
>> column in the log files.
>> Suppose the log file has structure -
>>
>> C1 C2 C3
>>
>> and I want sum of all C2 cols..
>>
>> I thought of this Map Reduce solution -
>>
>> Map- Key = 1
>>         Value = C2
>>
>> At Reduce I will get - ( 1, [values of C2 col] )
>>
>> Now my question is if log file has billion of records, so this would
>> mean effectively one reducer will receive a list of billion values.
>> How is reducer going to handle such huge data ?
>> If my approach is wrong, please suggest an alternative.
>>
>> Thanks,
>> Taran
>
>

Re: Question - Sum of a col in log file - memory requirement in reducer

Posted by Ted Dunning <td...@veoh.com>.

Your approach is close to correct.

You can also define what is called a combiner which is very much like a
reducer except that it is given only the partial set of results that are
output from the mapper.

For associative functions like addition, you can add up groups of numbers
and then add up those sums to get the same answer.  This means that you can
have the same function for reducer and combiner.  What will happen is that
the mapper will select the column as you describe, but the combiner will be
run on those values so that the reducer will see one value per mapper.  For
very large inputs, this will typically be only a few thousand or a few tens
of thousands of rows.  There would only be a single output row.  This means
that the overhead of the single reducer will negligible.

This is analogous to the famous word counting example.

Another interesting thing you can do is if you have inputs like this:

   C1 C2 C3  

The map can output multiple records, thus,

    1  C1
    2  C2
    3  C3

The reducer and combiner would be as before except that there would be
roughly 3 times as many intermediate rows and there would be 3 output rows.

Does this help?

On 5/27/08 2:53 PM, "Tarandeep Singh" <ta...@gmail.com> wrote:

> Hi,
> 
> Is it correct that an intermediate key from a mapper goes to only 1 reducer ?
> If yes, then if I have to sum up values of some col in a log file, a
> reducer will consume a lot of memory -
> 
> I have a simple requirement - to sum up the values of one of the
> column in the log files.
> Suppose the log file has structure -
> 
> C1 C2 C3
> 
> and I want sum of all C2 cols..
> 
> I thought of this Map Reduce solution -
> 
> Map- Key = 1
>         Value = C2
> 
> At Reduce I will get - ( 1, [values of C2 col] )
> 
> Now my question is if log file has billion of records, so this would
> mean effectively one reducer will receive a list of billion values.
> How is reducer going to handle such huge data ?
> If my approach is wrong, please suggest an alternative.
> 
> Thanks,
> Taran