You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Paco NATHAN <ce...@gmail.com> on 2008/11/17 23:29:34 UTC

combiner stats

Could someone please help explain the job counters shown for Combine
records on the JobTracker JSP page?

Here's an example from one of our MR jobs.  There are Combine input
and output record counters shown for both Map phase and Reduce phase.
We're not quite sure how to interpret them -

Map Phase:
   Map input records   85,013,261,279
   Map output records   85,013,261,279
   Combine input records   114,936,724,505
   Combine output records   38,750,511,975

Reduce Phase:
   Combine input records   8,827,017,275
   Combine output records   17,986,654
   Reduce input groups   2,221,796
   Reduce input records   17,986,654
   Reduce output records   4,443,590


What makes sense:
   * Considering the MR job and its data, the 85.0b count for Map
output records is expected
   * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
   * Reduce phase shows Combine output records at 18.0m = Reduce input
records at 18.0m
   * Reduce input groups at 2.2m is expected
   * Reduce output records at 4.4m is verified

What doesn't make sense:
   * The 115b count for Combine input records during Map phase
   * The 8.8b count for Combine input records during Reduce phase

What would be the actual count of records coming out of the Map phase?

Thanks,
Paco

Re: combiner stats

Posted by Devaraj Das <dd...@yahoo-inc.com>.


On 11/18/08 6:36 PM, "Paco NATHAN" <ce...@gmail.com> wrote:

> Thank you, Devaraj -
> That explanation helps a lot.
> 
> Is the following reasonable to say?
> 
>     Combine input records count shown in the Map phase column of the
> report is a measure of how many times records have passed through the
> Combiner during merges of intermediate spills. Therefore, it may be
> larger than the actual count of records which are being merged.
> 
> 

Yes, but to be precise you should say sorts and merges instead of just
merges (as you might know that map does a sort of the map output buffer data
whenever it has collected sufficient data, and the data that gets spilled to
disk are the records that the combiner outputs).

> Paco
> 
> 
>> On the map side, the combiner is called after sort and during the merges of
>> the intermediate spills. At the end a single spill file is generated. Note
>> that, during the merges, the same record may pass multiple times through the
>> combiner.
> 
> On Mon, Nov 17, 2008 at 23:04, Devaraj Das <dd...@yahoo-inc.com> wrote:
>> 
>> 
>> 
>> On 11/18/08 3:59 AM, "Paco NATHAN" <ce...@gmail.com> wrote:
>> 
>>> Could someone please help explain the job counters shown for Combine
>>> records on the JobTracker JSP page?
>>> 
>>> Here's an example from one of our MR jobs.  There are Combine input
>>> and output record counters shown for both Map phase and Reduce phase.
>>> We're not quite sure how to interpret them -
>>> 
>>> Map Phase:
>>>    Map input records   85,013,261,279
>>>    Map output records   85,013,261,279
>>>    Combine input records   114,936,724,505
>>>    Combine output records   38,750,511,975
>>> 
>>> Reduce Phase:
>>>    Combine input records   8,827,017,275
>>>    Combine output records   17,986,654
>>>    Reduce input groups   2,221,796
>>>    Reduce input records   17,986,654
>>>    Reduce output records   4,443,590
>>> 
>>> 
>>> What makes sense:
>>>    * Considering the MR job and its data, the 85.0b count for Map
>>> output records is expected
>>>    * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
>>>    * Reduce phase shows Combine output records at 18.0m = Reduce input
>>> records at 18.0m
>>>    * Reduce input groups at 2.2m is expected
>>>    * Reduce output records at 4.4m is verified
>>> 
>>> What doesn't make sense:
>>>    * The 115b count for Combine input records during Map phase
>>>    * The 8.8b count for Combine input records during Reduce phase
>>> 
>> 
>> On the map side, the combiner is called after sort and during the merges of
>> the intermediate spills. At the end a single spill file is generated. Note
>> that, during the merges, the same record may pass multiple times through the
>> combiner.
>> On the reducer side, the combiner would be called only during merges of
>> intermediate data, and the intermediate merges stops at a certain point (we
>> have <= io.sort.factor files remaining). Hence the combiner may be called
>> fewer times here...
>> 
>>> What would be the actual count of records coming out of the Map phase?
>>> 
>>> Thanks,
>>> Paco
>> 
>> 
>> 



Re: combiner stats

Posted by Paco NATHAN <ce...@gmail.com>.
Thank you, Devaraj -
That explanation helps a lot.

Is the following reasonable to say?

    Combine input records count shown in the Map phase column of the
report is a measure of how many times records have passed through the
Combiner during merges of intermediate spills. Therefore, it may be
larger than the actual count of records which are being merged.


Paco


> On the map side, the combiner is called after sort and during the merges of
> the intermediate spills. At the end a single spill file is generated. Note
> that, during the merges, the same record may pass multiple times through the
> combiner.

On Mon, Nov 17, 2008 at 23:04, Devaraj Das <dd...@yahoo-inc.com> wrote:
>
>
>
> On 11/18/08 3:59 AM, "Paco NATHAN" <ce...@gmail.com> wrote:
>
>> Could someone please help explain the job counters shown for Combine
>> records on the JobTracker JSP page?
>>
>> Here's an example from one of our MR jobs.  There are Combine input
>> and output record counters shown for both Map phase and Reduce phase.
>> We're not quite sure how to interpret them -
>>
>> Map Phase:
>>    Map input records   85,013,261,279
>>    Map output records   85,013,261,279
>>    Combine input records   114,936,724,505
>>    Combine output records   38,750,511,975
>>
>> Reduce Phase:
>>    Combine input records   8,827,017,275
>>    Combine output records   17,986,654
>>    Reduce input groups   2,221,796
>>    Reduce input records   17,986,654
>>    Reduce output records   4,443,590
>>
>>
>> What makes sense:
>>    * Considering the MR job and its data, the 85.0b count for Map
>> output records is expected
>>    * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
>>    * Reduce phase shows Combine output records at 18.0m = Reduce input
>> records at 18.0m
>>    * Reduce input groups at 2.2m is expected
>>    * Reduce output records at 4.4m is verified
>>
>> What doesn't make sense:
>>    * The 115b count for Combine input records during Map phase
>>    * The 8.8b count for Combine input records during Reduce phase
>>
>
> On the map side, the combiner is called after sort and during the merges of
> the intermediate spills. At the end a single spill file is generated. Note
> that, during the merges, the same record may pass multiple times through the
> combiner.
> On the reducer side, the combiner would be called only during merges of
> intermediate data, and the intermediate merges stops at a certain point (we
> have <= io.sort.factor files remaining). Hence the combiner may be called
> fewer times here...
>
>> What would be the actual count of records coming out of the Map phase?
>>
>> Thanks,
>> Paco
>
>
>

Re: combiner stats

Posted by Devaraj Das <dd...@yahoo-inc.com>.


On 11/18/08 3:59 AM, "Paco NATHAN" <ce...@gmail.com> wrote:

> Could someone please help explain the job counters shown for Combine
> records on the JobTracker JSP page?
> 
> Here's an example from one of our MR jobs.  There are Combine input
> and output record counters shown for both Map phase and Reduce phase.
> We're not quite sure how to interpret them -
> 
> Map Phase:
>    Map input records   85,013,261,279
>    Map output records   85,013,261,279
>    Combine input records   114,936,724,505
>    Combine output records   38,750,511,975
> 
> Reduce Phase:
>    Combine input records   8,827,017,275
>    Combine output records   17,986,654
>    Reduce input groups   2,221,796
>    Reduce input records   17,986,654
>    Reduce output records   4,443,590
> 
> 
> What makes sense:
>    * Considering the MR job and its data, the 85.0b count for Map
> output records is expected
>    * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
>    * Reduce phase shows Combine output records at 18.0m = Reduce input
> records at 18.0m
>    * Reduce input groups at 2.2m is expected
>    * Reduce output records at 4.4m is verified
> 
> What doesn't make sense:
>    * The 115b count for Combine input records during Map phase
>    * The 8.8b count for Combine input records during Reduce phase
> 

On the map side, the combiner is called after sort and during the merges of
the intermediate spills. At the end a single spill file is generated. Note
that, during the merges, the same record may pass multiple times through the
combiner. 
On the reducer side, the combiner would be called only during merges of
intermediate data, and the intermediate merges stops at a certain point (we
have <= io.sort.factor files remaining). Hence the combiner may be called
fewer times here...

> What would be the actual count of records coming out of the Map phase?
> 
> Thanks,
> Paco