You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Le Zhao <le...@cs.cmu.edu> on 2010/01/27 17:57:08 UTC

When exactly is combiner invoked?

Hi - combiner performs on a chunk of mapper output data, but what 
exactly is the chunk cut off, or when exactly will the chunk be fed to 
the combiner?

1. Will it be after the mapper finishes processing an input record?
2. Will it be after the mapper outputs a key value pair that hits the 
memory limit?

This will be important to know, because strategy 1 gives more guarantee 
over output record duplicity than 2, say when an input record for the 
mapper can correspond to multiple output records with the same key.

Thanks,
Le

Re: When exactly is combiner invoked?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
But be careful, since combiners may execute "zero or more times"
depending upon mysterious internal logic. Relying upon combiners to do
significant work, as some of the Mahout clustering algorithms used to
do, will bite you.

Jeff


Gang Luo wrote:
> When the map function generate the intermediate result and first sent them to buffer, the partitioning and sorting will start working and , if you specify a combiner, it will be invoked at this time. This process is in parallel with the map function. When map function finishes, all the spills on disk will be merged, combiners will also be invoked at this time. 
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Le Zhao <le...@cs.cmu.edu>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/1/27 (周三) 11:57:08 上午
> 主   题: When exactly is combiner invoked?
>
> Hi - combiner performs on a chunk of mapper output data, but what exactly is the chunk cut off, or when exactly will the chunk be fed to the combiner?
>
> 1. Will it be after the mapper finishes processing an input record?
> 2. Will it be after the mapper outputs a key value pair that hits the memory limit?
>
> This will be important to know, because strategy 1 gives more guarantee over output record duplicity than 2, say when an input record for the mapper can correspond to multiple output records with the same key.
>
> Thanks,
> Le
>
>
>
>       ___________________________________________________________ 
>   好玩贺卡等你发,邮箱贺卡全新上线! 
> http://card.mail.cn.yahoo.com/
>
>   


Re: When exactly is combiner invoked?

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi Le,
I don't think mapreduce can completely combine all the records with the same key into one record. one situation is when "min.num.spills.for.combine" is too high, while you get less records than that which share the same key, the combiner will not be invoked on these records.

Actually, I think mapreduce is doing a merge sort and at the last round of merging, it load one bucket from each of the spilled files into memory. Combiner could only see and combine the records reside in memory currently. If a record comes after the previous part has been written back to disk, there is no chance for it to be combined with the previous part. 

-Gang



----- 原始邮件 ----
发件人: Le Zhao <le...@cs.cmu.edu>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/27 (周三) 5:23:51 下午
主   题: Re: When exactly is combiner invoked?

Gang, Jeff and Amogh,

Thanks for all the replies.

It seems no matter how many times internally combiners are invoked, the output for one specific map task will be *totally* partitioned and combined.  Then, the data is shuffled/sent to reducers.

That's good to know, because if combining isn't fully done on one map's output, there might be problems.  (E.g. for indexing a document, <word, docid> pairs are the mapper's output, and if records for the same document may end up not fully combined.  The inverted index may end up having duplicate records for the same <word, docid> tuple.  So reducer has to do extra work.)

Also, good idea to keep combiner light weight.

Thanks,
Le

Amogh Vasekar wrote:
> Hi,
> To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk.
> I'm not sure what exactly you intend to say by "finish processing an input record". Typically, the processing (map) ends with a output.collect.
> 
> Amogh
> 
> 
> 



      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Re: When exactly is combiner invoked?

Posted by Le Zhao <le...@cs.cmu.edu>.
Gang, Jeff and Amogh,

Thanks for all the replies.

It seems no matter how many times internally combiners are invoked, the 
output for one specific map task will be *totally* partitioned and 
combined.  Then, the data is shuffled/sent to reducers.

That's good to know, because if combining isn't fully done on one map's 
output, there might be problems.  (E.g. for indexing a document, <word, 
docid> pairs are the mapper's output, and if records for the same 
document may end up not fully combined.  The inverted index may end up 
having duplicate records for the same <word, docid> tuple.  So reducer 
has to do extra work.)

Also, good idea to keep combiner light weight.

Thanks,
Le

Amogh Vasekar wrote:
> Hi,
> To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk.
> I'm not sure what exactly you intend to say by "finish processing an input record". Typically, the processing (map) ends with a output.collect.
> 
> Amogh
> 
> 
> 

Re: When exactly is combiner invoked?

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk.
I'm not sure what exactly you intend to say by "finish processing an input record". Typically, the processing (map) ends with a output.collect.

Amogh



Re: When exactly is combiner invoked?

Posted by Gang Luo <lg...@yahoo.com.cn>.
When the map function generate the intermediate result and first sent them to buffer, the partitioning and sorting will start working and , if you specify a combiner, it will be invoked at this time. This process is in parallel with the map function. When map function finishes, all the spills on disk will be merged, combiners will also be invoked at this time. 

-Gang



----- 原始邮件 ----
发件人: Le Zhao <le...@cs.cmu.edu>
收件人: common-user@hadoop.apache.org
发送日期: 2010/1/27 (周三) 11:57:08 上午
主   题: When exactly is combiner invoked?

Hi - combiner performs on a chunk of mapper output data, but what exactly is the chunk cut off, or when exactly will the chunk be fed to the combiner?

1. Will it be after the mapper finishes processing an input record?
2. Will it be after the mapper outputs a key value pair that hits the memory limit?

This will be important to know, because strategy 1 gives more guarantee over output record duplicity than 2, say when an input record for the mapper can correspond to multiple output records with the same key.

Thanks,
Le



      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/