You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Felix.徐" <yg...@gmail.com> on 2013/07/16 09:52:03 UTC

Collect, Spill and Merge phases insight

Hi all,

I am trying to understand the process of Collect, Spill and Merge in Map,
I've referred to a few documentations but still have a few questions.

Here is my understanding about the spill phase in map:

1.Collect function add a record into the buffer.
2.If the buffer exceeds a threshold (determined by parameters like
io.sort.mb), spill phase begins.
3.Spill phase includes 3 actions : sort , combine and compression.
4.Spill may be performed multiple times thus a few spilled files will be
generated.
5.If there are more than 1 spilled files, Merge phase begins and merge
these files into a big one.

If there is any miss understanding about these phases, please correct me
,thanks!
And my questions are:

1.Where is the partition being calculated (in Collect or Spill) ?  Does
Collect simply append a record into the buffer and check whether we should
spill the buffer?

2.At Merge phase, since the spilled files are compressed, does it need to
uncompressed these files and compress them again? Since Merge may be
performed more than 1 round, does it compress intermediate files?

3.Does the Merge phase at Map and Reduce side almost the same (External
merge-sort combined with Min-Heap) ?

Re: Collect, Spill and Merge phases insight

Posted by Stephen Boesch <ja...@gmail.com>.
great questions, i am also looking forward to answers from expert(s) here.


2013/7/16 Felix.徐 <yg...@gmail.com>

> Hi all,
>
> I am trying to understand the process of Collect, Spill and Merge in Map,
> I've referred to a few documentations but still have a few questions.
>
> Here is my understanding about the spill phase in map:
>
> 1.Collect function add a record into the buffer.
> 2.If the buffer exceeds a threshold (determined by parameters like
> io.sort.mb), spill phase begins.
> 3.Spill phase includes 3 actions : sort , combine and compression.
> 4.Spill may be performed multiple times thus a few spilled files will be
> generated.
> 5.If there are more than 1 spilled files, Merge phase begins and merge
> these files into a big one.
>
> If there is any miss understanding about these phases, please correct me
> ,thanks!
> And my questions are:
>
> 1.Where is the partition being calculated (in Collect or Spill) ?  Does
> Collect simply append a record into the buffer and check whether we should
> spill the buffer?
>
> 2.At Merge phase, since the spilled files are compressed, does it need to
> uncompressed these files and compress them again? Since Merge may be
> performed more than 1 round, does it compress intermediate files?
>
> 3.Does the Merge phase at Map and Reduce side almost the same (External
> merge-sort combined with Min-Heap) ?
>
>

Re: Collect, Spill and Merge phases insight

Posted by Stephen Boesch <ja...@gmail.com>.
great questions, i am also looking forward to answers from expert(s) here.


2013/7/16 Felix.徐 <yg...@gmail.com>

> Hi all,
>
> I am trying to understand the process of Collect, Spill and Merge in Map,
> I've referred to a few documentations but still have a few questions.
>
> Here is my understanding about the spill phase in map:
>
> 1.Collect function add a record into the buffer.
> 2.If the buffer exceeds a threshold (determined by parameters like
> io.sort.mb), spill phase begins.
> 3.Spill phase includes 3 actions : sort , combine and compression.
> 4.Spill may be performed multiple times thus a few spilled files will be
> generated.
> 5.If there are more than 1 spilled files, Merge phase begins and merge
> these files into a big one.
>
> If there is any miss understanding about these phases, please correct me
> ,thanks!
> And my questions are:
>
> 1.Where is the partition being calculated (in Collect or Spill) ?  Does
> Collect simply append a record into the buffer and check whether we should
> spill the buffer?
>
> 2.At Merge phase, since the spilled files are compressed, does it need to
> uncompressed these files and compress them again? Since Merge may be
> performed more than 1 round, does it compress intermediate files?
>
> 3.Does the Merge phase at Map and Reduce side almost the same (External
> merge-sort combined with Min-Heap) ?
>
>

Re: Collect, Spill and Merge phases insight

Posted by Stephen Boesch <ja...@gmail.com>.
great questions, i am also looking forward to answers from expert(s) here.


2013/7/16 Felix.徐 <yg...@gmail.com>

> Hi all,
>
> I am trying to understand the process of Collect, Spill and Merge in Map,
> I've referred to a few documentations but still have a few questions.
>
> Here is my understanding about the spill phase in map:
>
> 1.Collect function add a record into the buffer.
> 2.If the buffer exceeds a threshold (determined by parameters like
> io.sort.mb), spill phase begins.
> 3.Spill phase includes 3 actions : sort , combine and compression.
> 4.Spill may be performed multiple times thus a few spilled files will be
> generated.
> 5.If there are more than 1 spilled files, Merge phase begins and merge
> these files into a big one.
>
> If there is any miss understanding about these phases, please correct me
> ,thanks!
> And my questions are:
>
> 1.Where is the partition being calculated (in Collect or Spill) ?  Does
> Collect simply append a record into the buffer and check whether we should
> spill the buffer?
>
> 2.At Merge phase, since the spilled files are compressed, does it need to
> uncompressed these files and compress them again? Since Merge may be
> performed more than 1 round, does it compress intermediate files?
>
> 3.Does the Merge phase at Map and Reduce side almost the same (External
> merge-sort combined with Min-Heap) ?
>
>

Re: Collect, Spill and Merge phases insight

Posted by Stephen Boesch <ja...@gmail.com>.
great questions, i am also looking forward to answers from expert(s) here.


2013/7/16 Felix.徐 <yg...@gmail.com>

> Hi all,
>
> I am trying to understand the process of Collect, Spill and Merge in Map,
> I've referred to a few documentations but still have a few questions.
>
> Here is my understanding about the spill phase in map:
>
> 1.Collect function add a record into the buffer.
> 2.If the buffer exceeds a threshold (determined by parameters like
> io.sort.mb), spill phase begins.
> 3.Spill phase includes 3 actions : sort , combine and compression.
> 4.Spill may be performed multiple times thus a few spilled files will be
> generated.
> 5.If there are more than 1 spilled files, Merge phase begins and merge
> these files into a big one.
>
> If there is any miss understanding about these phases, please correct me
> ,thanks!
> And my questions are:
>
> 1.Where is the partition being calculated (in Collect or Spill) ?  Does
> Collect simply append a record into the buffer and check whether we should
> spill the buffer?
>
> 2.At Merge phase, since the spilled files are compressed, does it need to
> uncompressed these files and compress them again? Since Merge may be
> performed more than 1 round, does it compress intermediate files?
>
> 3.Does the Merge phase at Map and Reduce side almost the same (External
> merge-sort combined with Min-Heap) ?
>
>