You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Yonggang Qiao <yo...@gmail.com> on 2010/01/05 22:12:26 UTC

Reduce output records Counter not right?

try a wide audience...

the number from Reduce output records Counter doesn't match its
actually # of records in the output files. although after reran it, it
did match. any idea what could be wrong?

Thanks,
Yonggang

Re: combiner statistics

Posted by Gang Luo <lg...@yahoo.com.cn>.

Thanks. What I mean is, the combiner doesn't "intentionally" re-read spilled records back to memory just to combine them. But it does happens that some records will be re-read for sort. I think combiner should work on those records.

 
-Gang



----- 原始邮件 ----
发件人： Ted Xu <te...@gmail.com>
收件人： common-user@hadoop.apache.org
发送日期： 2010/1/5 (周二) 8:43:53 下午
主   题： Re: combiner statistics

Hi Gang,

My understanding to this is that, the combiner has to re-read some records
> which have already been spilled to disk and combine them with those records
> which come later.
>

I believe the combine operation is done before map spill and after reduce
merge. Combine only occurs in the memory, instead of re-read records from
disks.


> Besides, I am not sure whether the combiner can guarantee there is only one
> record for each distinct key in each map task. Or does it just "try its
> best" to combine?
>

Yes, they can only "try their best".



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re: combiner statistics

Posted by Ted Xu <te...@gmail.com>.

Hi Gang,

My understanding to this is that, the combiner has to re-read some records
> which have already been spilled to disk and combine them with those records
> which come later.
>

I believe the combine operation is done before map spill and after reduce
merge. Combine only occurs in the memory, instead of re-read records from
disks.


> Besides, I am not sure whether the combiner can guarantee there is only one
> record for each distinct key in each map task. Or does it just "try its
> best" to combine?
>

Yes, they can only "try their best".

combiner statistics

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all,
when I run a mapreduce job using combiner, I find that the combiner input # > map output #, and combiner output # > reduce input #. My understanding to this is that, the combiner has to re-read some records which have already been spilled to disk and combine them with those records which come later. These re-read records are also counted as the "input", thus increase the input counter value. Similarly, for we possibly do the combine on the same key multiple times, we have to write it to disk multiple times, thus increase the combiner output counter. 

Please correct me it there is some problem in my understanding.

Besides, I am not sure whether the combiner can guarantee there is only one record for each distinct key in each map task. Or does it just "try its best" to combine?

Thanks.

-Gang



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/