You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Martin Dobmeier <ma...@gmail.com> on 2012/09/13 16:04:36 UTC

How does map-merge work exactly?

Hi all,

I'm greatly confused about the spill/sort/merge thing going on during the
Map phase.

Here are some stats:
- io.sort.mb = 256 MB (80% spill threshold)
- io.sort.factor = 64
- spills performed during Map: 117
- number of reducers: 96

Now I'm having real trouble understanding the following log output.

...
mapred.Merger: Merging 117 sorted segments
mapred.Merger: Down to the last merge-pass, with 0 segments left of total
size: 0 bytes
...
mapred.Merger: Merging 117 sorted segments
mapred.Merger: Merging 54 intermediate segments out of a total of 56
mapred.Merger: Down to the last merge-pass, with 3 segments left of total
size: 67119046 bytes
...
mapred.Merger: Merging 117 sorted segments
mapred.Merger: Merging 54 intermediate segments out of a total of 117
mapred.Merger: Down to the last merge-pass, with 64 segments left of total
size: 1609011189 bytes
...

What exactly is a segment? Is it the number of spills?
What does "0 segments left" mean? Does it mean that the merge could be
performed on the first pass?
Why are only 54 segments merged instead of "io.sort.factor" segments?
(io.sort.factor determines the number of files to merge during a pass,
right?)
Why is the merge performed "number of reducers" times? (I'm counting the
phrase "Merging 117 segments" exactly 96 times)

Thanks a lot!
Martin

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
Thanks, Chris. This cleared a lot of things up for me.

Martin

On Thu, Sep 20, 2012 at 1:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > Ah, alright. But why is Hadoop telling me that there are 117 segments
> given
> > that only 96 reducers have been configured?
> > (btw, I'm using Hadoop 1.0.0)
>
> There were 117 spills, so the merger starts with 117 files, does an
> intermediate merge of 54 segments (#reducers = 96 times), then a final
> merge of 64 segments (96 times). All of those layers produce log
> statements.
>
> > So the merger is called "number of reducers" times because it combines
> the
> > data for a particular reducer which is spread over all spill files,
> right?
>
> Yup, you have it. -C
>

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
Thanks, Chris. This cleared a lot of things up for me.

Martin

On Thu, Sep 20, 2012 at 1:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > Ah, alright. But why is Hadoop telling me that there are 117 segments
> given
> > that only 96 reducers have been configured?
> > (btw, I'm using Hadoop 1.0.0)
>
> There were 117 spills, so the merger starts with 117 files, does an
> intermediate merge of 54 segments (#reducers = 96 times), then a final
> merge of 64 segments (96 times). All of those layers produce log
> statements.
>
> > So the merger is called "number of reducers" times because it combines
> the
> > data for a particular reducer which is spread over all spill files,
> right?
>
> Yup, you have it. -C
>

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
Thanks, Chris. This cleared a lot of things up for me.

Martin

On Thu, Sep 20, 2012 at 1:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > Ah, alright. But why is Hadoop telling me that there are 117 segments
> given
> > that only 96 reducers have been configured?
> > (btw, I'm using Hadoop 1.0.0)
>
> There were 117 spills, so the merger starts with 117 files, does an
> intermediate merge of 54 segments (#reducers = 96 times), then a final
> merge of 64 segments (96 times). All of those layers produce log
> statements.
>
> > So the merger is called "number of reducers" times because it combines
> the
> > data for a particular reducer which is spread over all spill files,
> right?
>
> Yup, you have it. -C
>

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
Thanks, Chris. This cleared a lot of things up for me.

Martin

On Thu, Sep 20, 2012 at 1:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > Ah, alright. But why is Hadoop telling me that there are 117 segments
> given
> > that only 96 reducers have been configured?
> > (btw, I'm using Hadoop 1.0.0)
>
> There were 117 spills, so the merger starts with 117 files, does an
> intermediate merge of 54 segments (#reducers = 96 times), then a final
> merge of 64 segments (96 times). All of those layers produce log
> statements.
>
> > So the merger is called "number of reducers" times because it combines
> the
> > data for a particular reducer which is spread over all spill files,
> right?
>
> Yup, you have it. -C
>

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> Ah, alright. But why is Hadoop telling me that there are 117 segments given
> that only 96 reducers have been configured?
> (btw, I'm using Hadoop 1.0.0)

There were 117 spills, so the merger starts with 117 files, does an
intermediate merge of 54 segments (#reducers = 96 times), then a final
merge of 64 segments (96 times). All of those layers produce log
statements.

> So the merger is called "number of reducers" times because it combines the
> data for a particular reducer which is spread over all spill files, right?

Yup, you have it. -C

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> Ah, alright. But why is Hadoop telling me that there are 117 segments given
> that only 96 reducers have been configured?
> (btw, I'm using Hadoop 1.0.0)

There were 117 spills, so the merger starts with 117 files, does an
intermediate merge of 54 segments (#reducers = 96 times), then a final
merge of 64 segments (96 times). All of those layers produce log
statements.

> So the merger is called "number of reducers" times because it combines the
> data for a particular reducer which is spread over all spill files, right?

Yup, you have it. -C

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> Ah, alright. But why is Hadoop telling me that there are 117 segments given
> that only 96 reducers have been configured?
> (btw, I'm using Hadoop 1.0.0)

There were 117 spills, so the merger starts with 117 files, does an
intermediate merge of 54 segments (#reducers = 96 times), then a final
merge of 64 segments (96 times). All of those layers produce log
statements.

> So the merger is called "number of reducers" times because it combines the
> data for a particular reducer which is spread over all spill files, right?

Yup, you have it. -C

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Tue, Sep 18, 2012 at 7:02 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> Ah, alright. But why is Hadoop telling me that there are 117 segments given
> that only 96 reducers have been configured?
> (btw, I'm using Hadoop 1.0.0)

There were 117 spills, so the merger starts with 117 files, does an
intermediate merge of 54 segments (#reducers = 96 times), then a final
merge of 64 segments (96 times). All of those layers produce log
statements.

> So the merger is called "number of reducers" times because it combines the
> data for a particular reducer which is spread over all spill files, right?

Yup, you have it. -C

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
>> What exactly is a segment? Is it the number of spills?
>A segment in this context is a fraction of spill output for a particular
reduce. Each spill contains a segment for every reduce.

Ah, alright. But why is Hadoop telling me that there are 117 segments given
that only 96 reducers have been configured?
(btw, I'm using Hadoop 1.0.0)

>> Why are only 54 segments merged instead of "io.sort.factor" segments?
(io.sort.factor determines the number of files to merge during a pass,
right?)
> The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

Ok, that makes sense.

>> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)
> Each invocation of the merger is combining all the output assigned to a
reduce by the partitioner.

So the merger is called "number of reducers" times because it combines the
data for a particular reducer which is spread over all spill files, right?

Martin

On Mon, Sep 17, 2012 at 10:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > What exactly is a segment? Is it the number of spills?
>
> A segment in this context is a fraction of spill output for a
> particular reduce. Each spill contains a segment for every reduce.
>
> > What does "0 segments left" mean? Does it mean that the merge could be
> > performed on the first pass?
> > Why are only 54 segments merged instead of "io.sort.factor" segments?
>
> The intermediate merge of 54 files to 1 reduces the number of files to
> 117 - 53 = 64 segments. The final merge is over 64 segments.
>
> > (io.sort.factor determines the number of files to merge during a pass,
> > right?)
> > Why is the merge performed "number of reducers" times? (I'm counting the
> > phrase "Merging 117 segments" exactly 96 times)
>
> Each invocation of the merger is combining all the output assigned to
> a reduce by the partitioner. -C
>

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
>> What exactly is a segment? Is it the number of spills?
>A segment in this context is a fraction of spill output for a particular
reduce. Each spill contains a segment for every reduce.

Ah, alright. But why is Hadoop telling me that there are 117 segments given
that only 96 reducers have been configured?
(btw, I'm using Hadoop 1.0.0)

>> Why are only 54 segments merged instead of "io.sort.factor" segments?
(io.sort.factor determines the number of files to merge during a pass,
right?)
> The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

Ok, that makes sense.

>> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)
> Each invocation of the merger is combining all the output assigned to a
reduce by the partitioner.

So the merger is called "number of reducers" times because it combines the
data for a particular reducer which is spread over all spill files, right?

Martin

On Mon, Sep 17, 2012 at 10:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > What exactly is a segment? Is it the number of spills?
>
> A segment in this context is a fraction of spill output for a
> particular reduce. Each spill contains a segment for every reduce.
>
> > What does "0 segments left" mean? Does it mean that the merge could be
> > performed on the first pass?
> > Why are only 54 segments merged instead of "io.sort.factor" segments?
>
> The intermediate merge of 54 files to 1 reduces the number of files to
> 117 - 53 = 64 segments. The final merge is over 64 segments.
>
> > (io.sort.factor determines the number of files to merge during a pass,
> > right?)
> > Why is the merge performed "number of reducers" times? (I'm counting the
> > phrase "Merging 117 segments" exactly 96 times)
>
> Each invocation of the merger is combining all the output assigned to
> a reduce by the partitioner. -C
>

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
>> What exactly is a segment? Is it the number of spills?
>A segment in this context is a fraction of spill output for a particular
reduce. Each spill contains a segment for every reduce.

Ah, alright. But why is Hadoop telling me that there are 117 segments given
that only 96 reducers have been configured?
(btw, I'm using Hadoop 1.0.0)

>> Why are only 54 segments merged instead of "io.sort.factor" segments?
(io.sort.factor determines the number of files to merge during a pass,
right?)
> The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

Ok, that makes sense.

>> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)
> Each invocation of the merger is combining all the output assigned to a
reduce by the partitioner.

So the merger is called "number of reducers" times because it combines the
data for a particular reducer which is spread over all spill files, right?

Martin

On Mon, Sep 17, 2012 at 10:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > What exactly is a segment? Is it the number of spills?
>
> A segment in this context is a fraction of spill output for a
> particular reduce. Each spill contains a segment for every reduce.
>
> > What does "0 segments left" mean? Does it mean that the merge could be
> > performed on the first pass?
> > Why are only 54 segments merged instead of "io.sort.factor" segments?
>
> The intermediate merge of 54 files to 1 reduces the number of files to
> 117 - 53 = 64 segments. The final merge is over 64 segments.
>
> > (io.sort.factor determines the number of files to merge during a pass,
> > right?)
> > Why is the merge performed "number of reducers" times? (I'm counting the
> > phrase "Merging 117 segments" exactly 96 times)
>
> Each invocation of the merger is combining all the output assigned to
> a reduce by the partitioner. -C
>

Re: How does map-merge work exactly?

Posted by Martin Dobmeier <ma...@gmail.com>.
>> What exactly is a segment? Is it the number of spills?
>A segment in this context is a fraction of spill output for a particular
reduce. Each spill contains a segment for every reduce.

Ah, alright. But why is Hadoop telling me that there are 117 segments given
that only 96 reducers have been configured?
(btw, I'm using Hadoop 1.0.0)

>> Why are only 54 segments merged instead of "io.sort.factor" segments?
(io.sort.factor determines the number of files to merge during a pass,
right?)
> The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

Ok, that makes sense.

>> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)
> Each invocation of the merger is combining all the output assigned to a
reduce by the partitioner.

So the merger is called "number of reducers" times because it combines the
data for a particular reducer which is spread over all spill files, right?

Martin

On Mon, Sep 17, 2012 at 10:21 AM, Chris Douglas <cd...@apache.org> wrote:

> On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
> <ma...@gmail.com> wrote:
> > What exactly is a segment? Is it the number of spills?
>
> A segment in this context is a fraction of spill output for a
> particular reduce. Each spill contains a segment for every reduce.
>
> > What does "0 segments left" mean? Does it mean that the merge could be
> > performed on the first pass?
> > Why are only 54 segments merged instead of "io.sort.factor" segments?
>
> The intermediate merge of 54 files to 1 reduces the number of files to
> 117 - 53 = 64 segments. The final merge is over 64 segments.
>
> > (io.sort.factor determines the number of files to merge during a pass,
> > right?)
> > Why is the merge performed "number of reducers" times? (I'm counting the
> > phrase "Merging 117 segments" exactly 96 times)
>
> Each invocation of the merger is combining all the output assigned to
> a reduce by the partitioner. -C
>

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> What exactly is a segment? Is it the number of spills?

A segment in this context is a fraction of spill output for a
particular reduce. Each spill contains a segment for every reduce.

> What does "0 segments left" mean? Does it mean that the merge could be
> performed on the first pass?
> Why are only 54 segments merged instead of "io.sort.factor" segments?

The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

> (io.sort.factor determines the number of files to merge during a pass,
> right?)
> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)

Each invocation of the merger is combining all the output assigned to
a reduce by the partitioner. -C

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> What exactly is a segment? Is it the number of spills?

A segment in this context is a fraction of spill output for a
particular reduce. Each spill contains a segment for every reduce.

> What does "0 segments left" mean? Does it mean that the merge could be
> performed on the first pass?
> Why are only 54 segments merged instead of "io.sort.factor" segments?

The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

> (io.sort.factor determines the number of files to merge during a pass,
> right?)
> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)

Each invocation of the merger is combining all the output assigned to
a reduce by the partitioner. -C

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> What exactly is a segment? Is it the number of spills?

A segment in this context is a fraction of spill output for a
particular reduce. Each spill contains a segment for every reduce.

> What does "0 segments left" mean? Does it mean that the merge could be
> performed on the first pass?
> Why are only 54 segments merged instead of "io.sort.factor" segments?

The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

> (io.sort.factor determines the number of files to merge during a pass,
> right?)
> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)

Each invocation of the merger is combining all the output assigned to
a reduce by the partitioner. -C

Re: How does map-merge work exactly?

Posted by Chris Douglas <cd...@apache.org>.
On Thu, Sep 13, 2012 at 7:04 AM, Martin Dobmeier
<ma...@gmail.com> wrote:
> What exactly is a segment? Is it the number of spills?

A segment in this context is a fraction of spill output for a
particular reduce. Each spill contains a segment for every reduce.

> What does "0 segments left" mean? Does it mean that the merge could be
> performed on the first pass?
> Why are only 54 segments merged instead of "io.sort.factor" segments?

The intermediate merge of 54 files to 1 reduces the number of files to
117 - 53 = 64 segments. The final merge is over 64 segments.

> (io.sort.factor determines the number of files to merge during a pass,
> right?)
> Why is the merge performed "number of reducers" times? (I'm counting the
> phrase "Merging 117 segments" exactly 96 times)

Each invocation of the merger is combining all the output assigned to
a reduce by the partitioner. -C