You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Yonggang Qiao <yo...@gmail.com> on 2010/01/05 22:12:26 UTC
Reduce output records Counter not right?
try a wide audience...
the number from Reduce output records Counter doesn't match its
actually # of records in the output files. although after reran it, it
did match. any idea what could be wrong?
Thanks,
Yonggang
Re: combiner statistics
Posted by Gang Luo <lg...@yahoo.com.cn>.
Thanks. What I mean is, the combiner doesn't "intentionally" re-read spilled records back to memory just to combine them. But it does happens that some records will be re-read for sort. I think combiner should work on those records.
-Gang
----- 原始邮件 ----
发件人: Ted Xu <te...@gmail.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/5 (周二) 8:43:53 下午
主 题: Re: combiner statistics
Hi Gang,
My understanding to this is that, the combiner has to re-read some records
> which have already been spilled to disk and combine them with those records
> which come later.
>
I believe the combine operation is done before map spill and after reduce
merge. Combine only occurs in the memory, instead of re-read records from
disks.
> Besides, I am not sure whether the combiner can guarantee there is only one
> record for each distinct key in each map task. Or does it just "try its
> best" to combine?
>
Yes, they can only "try their best".
___________________________________________________________
好玩贺卡等你发,邮箱贺卡全新上线!
http://card.mail.cn.yahoo.com/
Re: combiner statistics
Posted by Ted Xu <te...@gmail.com>.
Hi Gang,
My understanding to this is that, the combiner has to re-read some records
> which have already been spilled to disk and combine them with those records
> which come later.
>
I believe the combine operation is done before map spill and after reduce
merge. Combine only occurs in the memory, instead of re-read records from
disks.
> Besides, I am not sure whether the combiner can guarantee there is only one
> record for each distinct key in each map task. Or does it just "try its
> best" to combine?
>
Yes, they can only "try their best".
combiner statistics
Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi all,
when I run a mapreduce job using combiner, I find that the combiner input # > map output #, and combiner output # > reduce input #. My understanding to this is that, the combiner has to re-read some records which have already been spilled to disk and combine them with those records which come later. These re-read records are also counted as the "input", thus increase the input counter value. Similarly, for we possibly do the combine on the same key multiple times, we have to write it to disk multiple times, thus increase the combiner output counter.
Please correct me it there is some problem in my understanding.
Besides, I am not sure whether the combiner can guarantee there is only one record for each distinct key in each map task. Or does it just "try its best" to combine?
Thanks.
-Gang
___________________________________________________________
好玩贺卡等你发,邮箱贺卡全新上线!
http://card.mail.cn.yahoo.com/