You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Jason Yang <li...@gmail.com> on 2012/09/20 15:12:08 UTC

Will all the intermediate output with the same key go to the same reducer?

Hi, all

I have a question that whether all the intermediate output with the same
key go to the same reducer or not?

If it is, in case of only two keys are generated from mapper, but there are
3 reducer running in this job, what would happen?

If not, how could I do some processing over the all data, like counting? I
think some would suggest to set the number of reducer to 1, but I thought
this would make the reducer to be the bottleneck when there are large
volume of intermediate output, isn't it?

-- 
YANG, Lin

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by feng lu <am...@gmail.com>.
Hi

>>
If not, how could I do some processing over the all data, like counting?
<<
Maybe you can refer to the teraSort example in hadoop. it use a  partitioner
that splits text keys into roughly equal partitions in a global sorted
order.

On Thu, Sep 20, 2012 at 9:28 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Hi,
>
> Yes. By contract, all intermediate output with the same key goes to
> the same reducer.
>
> In your example, suppose of the two keys generated from the mapper,
> one key goes to reducer 1 and the second goes to reducer 2, reducer 3
> will not have any records to process and end without producing any
> output.
>
> If the intermediate key space is very large, 1 reducer would certainly
> be a bottleneck, as you rightly note. Hence, configuring the right
> number of reducers would be certainly important.
>
> Thanks
> hemanth
>
> On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> > Hi, all
> >
> > I have a question that whether all the intermediate output with the same
> > key go to the same reducer or not?
> >
> > If it is, in case of only two keys are generated from mapper, but there
> are
> > 3 reducer running in this job, what would happen?
> >
> > If not, how could I do some processing over the all data, like counting?
> I
> > think some would suggest to set the number of reducer to 1, but I thought
> > this would make the reducer to be the bottleneck when there are large
> > volume of intermediate output, isn't it?
> >
> > --
> > YANG, Lin
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by feng lu <am...@gmail.com>.
Hi

>>
If not, how could I do some processing over the all data, like counting?
<<
Maybe you can refer to the teraSort example in hadoop. it use a  partitioner
that splits text keys into roughly equal partitions in a global sorted
order.

On Thu, Sep 20, 2012 at 9:28 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Hi,
>
> Yes. By contract, all intermediate output with the same key goes to
> the same reducer.
>
> In your example, suppose of the two keys generated from the mapper,
> one key goes to reducer 1 and the second goes to reducer 2, reducer 3
> will not have any records to process and end without producing any
> output.
>
> If the intermediate key space is very large, 1 reducer would certainly
> be a bottleneck, as you rightly note. Hence, configuring the right
> number of reducers would be certainly important.
>
> Thanks
> hemanth
>
> On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> > Hi, all
> >
> > I have a question that whether all the intermediate output with the same
> > key go to the same reducer or not?
> >
> > If it is, in case of only two keys are generated from mapper, but there
> are
> > 3 reducer running in this job, what would happen?
> >
> > If not, how could I do some processing over the all data, like counting?
> I
> > think some would suggest to set the number of reducer to 1, but I thought
> > this would make the reducer to be the bottleneck when there are large
> > volume of intermediate output, isn't it?
> >
> > --
> > YANG, Lin
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by feng lu <am...@gmail.com>.
Hi

>>
If not, how could I do some processing over the all data, like counting?
<<
Maybe you can refer to the teraSort example in hadoop. it use a  partitioner
that splits text keys into roughly equal partitions in a global sorted
order.

On Thu, Sep 20, 2012 at 9:28 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Hi,
>
> Yes. By contract, all intermediate output with the same key goes to
> the same reducer.
>
> In your example, suppose of the two keys generated from the mapper,
> one key goes to reducer 1 and the second goes to reducer 2, reducer 3
> will not have any records to process and end without producing any
> output.
>
> If the intermediate key space is very large, 1 reducer would certainly
> be a bottleneck, as you rightly note. Hence, configuring the right
> number of reducers would be certainly important.
>
> Thanks
> hemanth
>
> On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> > Hi, all
> >
> > I have a question that whether all the intermediate output with the same
> > key go to the same reducer or not?
> >
> > If it is, in case of only two keys are generated from mapper, but there
> are
> > 3 reducer running in this job, what would happen?
> >
> > If not, how could I do some processing over the all data, like counting?
> I
> > think some would suggest to set the number of reducer to 1, but I thought
> > this would make the reducer to be the bottleneck when there are large
> > volume of intermediate output, isn't it?
> >
> > --
> > YANG, Lin
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by feng lu <am...@gmail.com>.
Hi

>>
If not, how could I do some processing over the all data, like counting?
<<
Maybe you can refer to the teraSort example in hadoop. it use a  partitioner
that splits text keys into roughly equal partitions in a global sorted
order.

On Thu, Sep 20, 2012 at 9:28 PM, Hemanth Yamijala <yhemanth@thoughtworks.com
> wrote:

> Hi,
>
> Yes. By contract, all intermediate output with the same key goes to
> the same reducer.
>
> In your example, suppose of the two keys generated from the mapper,
> one key goes to reducer 1 and the second goes to reducer 2, reducer 3
> will not have any records to process and end without producing any
> output.
>
> If the intermediate key space is very large, 1 reducer would certainly
> be a bottleneck, as you rightly note. Hence, configuring the right
> number of reducers would be certainly important.
>
> Thanks
> hemanth
>
> On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> > Hi, all
> >
> > I have a question that whether all the intermediate output with the same
> > key go to the same reducer or not?
> >
> > If it is, in case of only two keys are generated from mapper, but there
> are
> > 3 reducer running in this job, what would happen?
> >
> > If not, how could I do some processing over the all data, like counting?
> I
> > think some would suggest to set the number of reducer to 1, but I thought
> > this would make the reducer to be the bottleneck when there are large
> > volume of intermediate output, isn't it?
> >
> > --
> > YANG, Lin
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Yes. By contract, all intermediate output with the same key goes to
the same reducer.

In your example, suppose of the two keys generated from the mapper,
one key goes to reducer 1 and the second goes to reducer 2, reducer 3
will not have any records to process and end without producing any
output.

If the intermediate key space is very large, 1 reducer would certainly
be a bottleneck, as you rightly note. Hence, configuring the right
number of reducers would be certainly important.

Thanks
hemanth

On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there are
> 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Sambit Tripathy <sa...@gmail.com>.
Hi,

Have you considered using an in-mapper combining pattern? i.e Inside your
Mapper object you can create a Map object holding the intermediate
key-values whose state is preserved across multiple calls of map method.
The values are emitted periodically only when certain threshold
reached(threshold = ratio between block size and memory consumed). You can
make use of a counter to check the number of key-value pairs has been
processed. You can substantially avoid the problem: "reducer to be the
bottleneck when there are large volume of intermediate output" as you have
already a lesser number of intermediate keys in-memory which are flushed on
a specific bucket size.


Thanks
Sambit Tripathy



On Thu, Sep 20, 2012 at 6:42 PM, Jason Yang <li...@gmail.com>wrote:

> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there
> are 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>
>

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Yes. By contract, all intermediate output with the same key goes to
the same reducer.

In your example, suppose of the two keys generated from the mapper,
one key goes to reducer 1 and the second goes to reducer 2, reducer 3
will not have any records to process and end without producing any
output.

If the intermediate key space is very large, 1 reducer would certainly
be a bottleneck, as you rightly note. Hence, configuring the right
number of reducers would be certainly important.

Thanks
hemanth

On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there are
> 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Sambit Tripathy <sa...@gmail.com>.
Hi,

Have you considered using an in-mapper combining pattern? i.e Inside your
Mapper object you can create a Map object holding the intermediate
key-values whose state is preserved across multiple calls of map method.
The values are emitted periodically only when certain threshold
reached(threshold = ratio between block size and memory consumed). You can
make use of a counter to check the number of key-value pairs has been
processed. You can substantially avoid the problem: "reducer to be the
bottleneck when there are large volume of intermediate output" as you have
already a lesser number of intermediate keys in-memory which are flushed on
a specific bucket size.


Thanks
Sambit Tripathy



On Thu, Sep 20, 2012 at 6:42 PM, Jason Yang <li...@gmail.com>wrote:

> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there
> are 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>
>

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Sambit Tripathy <sa...@gmail.com>.
Hi,

Have you considered using an in-mapper combining pattern? i.e Inside your
Mapper object you can create a Map object holding the intermediate
key-values whose state is preserved across multiple calls of map method.
The values are emitted periodically only when certain threshold
reached(threshold = ratio between block size and memory consumed). You can
make use of a counter to check the number of key-value pairs has been
processed. You can substantially avoid the problem: "reducer to be the
bottleneck when there are large volume of intermediate output" as you have
already a lesser number of intermediate keys in-memory which are flushed on
a specific bucket size.


Thanks
Sambit Tripathy



On Thu, Sep 20, 2012 at 6:42 PM, Jason Yang <li...@gmail.com>wrote:

> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there
> are 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>
>

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Yes. By contract, all intermediate output with the same key goes to
the same reducer.

In your example, suppose of the two keys generated from the mapper,
one key goes to reducer 1 and the second goes to reducer 2, reducer 3
will not have any records to process and end without producing any
output.

If the intermediate key space is very large, 1 reducer would certainly
be a bottleneck, as you rightly note. Hence, configuring the right
number of reducers would be certainly important.

Thanks
hemanth

On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there are
> 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Sambit Tripathy <sa...@gmail.com>.
Hi,

Have you considered using an in-mapper combining pattern? i.e Inside your
Mapper object you can create a Map object holding the intermediate
key-values whose state is preserved across multiple calls of map method.
The values are emitted periodically only when certain threshold
reached(threshold = ratio between block size and memory consumed). You can
make use of a counter to check the number of key-value pairs has been
processed. You can substantially avoid the problem: "reducer to be the
bottleneck when there are large volume of intermediate output" as you have
already a lesser number of intermediate keys in-memory which are flushed on
a specific bucket size.


Thanks
Sambit Tripathy



On Thu, Sep 20, 2012 at 6:42 PM, Jason Yang <li...@gmail.com>wrote:

> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there
> are 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>
>

Re: Will all the intermediate output with the same key go to the same reducer?

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Yes. By contract, all intermediate output with the same key goes to
the same reducer.

In your example, suppose of the two keys generated from the mapper,
one key goes to reducer 1 and the second goes to reducer 2, reducer 3
will not have any records to process and end without producing any
output.

If the intermediate key space is very large, 1 reducer would certainly
be a bottleneck, as you rightly note. Hence, configuring the right
number of reducers would be certainly important.

Thanks
hemanth

On 9/20/12, Jason Yang <li...@gmail.com> wrote:
> Hi, all
>
> I have a question that whether all the intermediate output with the same
> key go to the same reducer or not?
>
> If it is, in case of only two keys are generated from mapper, but there are
> 3 reducer running in this job, what would happen?
>
> If not, how could I do some processing over the all data, like counting? I
> think some would suggest to set the number of reducer to 1, but I thought
> this would make the reducer to be the bottleneck when there are large
> volume of intermediate output, isn't it?
>
> --
> YANG, Lin
>