You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Le Zhao <le...@cs.cmu.edu> on 2010/01/27 17:57:08 UTC
When exactly is combiner invoked?
Hi - combiner performs on a chunk of mapper output data, but what
exactly is the chunk cut off, or when exactly will the chunk be fed to
the combiner?
1. Will it be after the mapper finishes processing an input record?
2. Will it be after the mapper outputs a key value pair that hits the
memory limit?
This will be important to know, because strategy 1 gives more guarantee
over output record duplicity than 2, say when an input record for the
mapper can correspond to multiple output records with the same key.
Thanks,
Le
Re: When exactly is combiner invoked?
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
But be careful, since combiners may execute "zero or more times"
depending upon mysterious internal logic. Relying upon combiners to do
significant work, as some of the Mahout clustering algorithms used to
do, will bite you.
Jeff
Gang Luo wrote:
> When the map function generate the intermediate result and first sent them to buffer, the partitioning and sorting will start working and , if you specify a combiner, it will be invoked at this time. This process is in parallel with the map function. When map function finishes, all the spills on disk will be merged, combiners will also be invoked at this time.
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Le Zhao <le...@cs.cmu.edu>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/1/27 (周三) 11:57:08 上午
> 主 题: When exactly is combiner invoked?
>
> Hi - combiner performs on a chunk of mapper output data, but what exactly is the chunk cut off, or when exactly will the chunk be fed to the combiner?
>
> 1. Will it be after the mapper finishes processing an input record?
> 2. Will it be after the mapper outputs a key value pair that hits the memory limit?
>
> This will be important to know, because strategy 1 gives more guarantee over output record duplicity than 2, say when an input record for the mapper can correspond to multiple output records with the same key.
>
> Thanks,
> Le
>
>
>
> ___________________________________________________________
> 好玩贺卡等你发,邮箱贺卡全新上线!
> http://card.mail.cn.yahoo.com/
>
>
Re: When exactly is combiner invoked?
Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi Le,
I don't think mapreduce can completely combine all the records with the same key into one record. one situation is when "min.num.spills.for.combine" is too high, while you get less records than that which share the same key, the combiner will not be invoked on these records.
Actually, I think mapreduce is doing a merge sort and at the last round of merging, it load one bucket from each of the spilled files into memory. Combiner could only see and combine the records reside in memory currently. If a record comes after the previous part has been written back to disk, there is no chance for it to be combined with the previous part.
-Gang
----- 原始邮件 ----
发件人: Le Zhao <le...@cs.cmu.edu>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/27 (周三) 5:23:51 下午
主 题: Re: When exactly is combiner invoked?
Gang, Jeff and Amogh,
Thanks for all the replies.
It seems no matter how many times internally combiners are invoked, the output for one specific map task will be *totally* partitioned and combined. Then, the data is shuffled/sent to reducers.
That's good to know, because if combining isn't fully done on one map's output, there might be problems. (E.g. for indexing a document, <word, docid> pairs are the mapper's output, and if records for the same document may end up not fully combined. The inverted index may end up having duplicate records for the same <word, docid> tuple. So reducer has to do extra work.)
Also, good idea to keep combiner light weight.
Thanks,
Le
Amogh Vasekar wrote:
> Hi,
> To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk.
> I'm not sure what exactly you intend to say by "finish processing an input record". Typically, the processing (map) ends with a output.collect.
>
> Amogh
>
>
>
___________________________________________________________
好玩贺卡等你发,邮箱贺卡全新上线!
http://card.mail.cn.yahoo.com/
Re: When exactly is combiner invoked?
Posted by Le Zhao <le...@cs.cmu.edu>.
Gang, Jeff and Amogh,
Thanks for all the replies.
It seems no matter how many times internally combiners are invoked, the
output for one specific map task will be *totally* partitioned and
combined. Then, the data is shuffled/sent to reducers.
That's good to know, because if combining isn't fully done on one map's
output, there might be problems. (E.g. for indexing a document, <word,
docid> pairs are the mapper's output, and if records for the same
document may end up not fully combined. The inverted index may end up
having duplicate records for the same <word, docid> tuple. So reducer
has to do extra work.)
Also, good idea to keep combiner light weight.
Thanks,
Le
Amogh Vasekar wrote:
> Hi,
> To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk.
> I'm not sure what exactly you intend to say by "finish processing an input record". Typically, the processing (map) ends with a output.collect.
>
> Amogh
>
>
>
Re: When exactly is combiner invoked?
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk.
I'm not sure what exactly you intend to say by "finish processing an input record". Typically, the processing (map) ends with a output.collect.
Amogh
Re: When exactly is combiner invoked?
Posted by Gang Luo <lg...@yahoo.com.cn>.
When the map function generate the intermediate result and first sent them to buffer, the partitioning and sorting will start working and , if you specify a combiner, it will be invoked at this time. This process is in parallel with the map function. When map function finishes, all the spills on disk will be merged, combiners will also be invoked at this time.
-Gang
----- 原始邮件 ----
发件人: Le Zhao <le...@cs.cmu.edu>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/27 (周三) 11:57:08 上午
主 题: When exactly is combiner invoked?
Hi - combiner performs on a chunk of mapper output data, but what exactly is the chunk cut off, or when exactly will the chunk be fed to the combiner?
1. Will it be after the mapper finishes processing an input record?
2. Will it be after the mapper outputs a key value pair that hits the memory limit?
This will be important to know, because strategy 1 gives more guarantee over output record duplicity than 2, say when an input record for the mapper can correspond to multiple output records with the same key.
Thanks,
Le
___________________________________________________________
好玩贺卡等你发,邮箱贺卡全新上线!
http://card.mail.cn.yahoo.com/