You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by "Clements, Michael" <Mi...@disney.com> on 2010/01/19 21:59:18 UTC

chained mappers & reducers

These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile

RE: chained mappers & reducers

Posted by "Clements, Michael" <Mi...@disney.com>.

That is what I figured. I saw the "ChainedReducer" class and got all
excited about chaining multiple reducers; but that is not what this
class does.

 

I see your point about it being in general impossible to chain reducers.
Chained mappers are OK, since each is emitting K-V pairs, there is no
need for any to have a full global set of all keys or values. But
chaining reducers is much more difficult, as each reducer needs the full
set of values for each key, which can't be known by a single reducer
instance.

 

Thanks



From:
mapreduce-user-return-315-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-315-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Amogh Vasekar
Sent: Thursday, January 21, 2010 4:17 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

 

Unless you can somehow guarantee that a certain output key K1 comes only
from reducer R1 ( seems very unlikely & somewhat useless in your case )
, I'm afraid you'll need to have a subsequent MR job. The thing is
Hadoop has no "in-built" mechanism for reducers to exchange data :)

Amogh


On 1/21/10 12:30 AM, "Clements, Michael" <Mi...@disney.com>
wrote:

The use case is this: M1-R1-R2
 
M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count
 
M2: identity pass-through
R2: sum counts by K2
 
In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count
 
R2 sums the counts for each K2
 
The output of R1 could be fed directly into R2. But I can't find a way
to do that in Hadoop. So I have to create a second job, which has to
have a Map phase, so I create a pass-through mapper. This works but it
has a lot of overhead. It would be faster & cleaner to run R1 directly
into R2 within the same job - if possible.
 
 

From:
mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of
R1 to be aggregated in some way ) , you have to write another map-red
job. If you need to process only local reducer data ( ie your reducer
output key is same as input key ),  your job would be M1-R1-M2.
Essentially in Hadoop, you can have one sort and shuffle phase in one
job.
Note that chain APIs are for jobs of the form (M+RM*).  

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com>
wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile

Re: chained mappers & reducers

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Unless you can somehow guarantee that a certain output key K1 comes only from reducer R1 ( seems very unlikely & somewhat useless in your case ) , I'm afraid you'll need to have a subsequent MR job. The thing is Hadoop has no "in-built" mechanism for reducers to exchange data :)

Amogh


On 1/21/10 12:30 AM, "Clements, Michael" <Mi...@disney.com> wrote:

The use case is this: M1-R1-R2

M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count

M2: identity pass-through
R2: sum counts by K2

In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count

R2 sums the counts for each K2

The output of R1 could be fed directly into R2. But I can't find a way to do that in Hadoop. So I have to create a second job, which has to have a Map phase, so I create a pass-through mapper. This works but it has a lot of overhead. It would be faster & cleaner to run R1 directly into R2 within the same job - if possible.



From: mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org [mailto:mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ),  your job would be M1-R1-M2. Essentially in Hadoop, you can have one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com> wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile

RE: chained mappers & reducers

Posted by "Clements, Michael" <Mi...@disney.com>.

The use case is this: M1-R1-R2

 

M1: generate K1-V1 pairs from input

R1: group by K1, generate new Keys K2 from group, with value V2, a count

 

M2: identity pass-through

R2: sum counts by K2

 

In short, R1 does this:

groups data by the K1 defined by M1

emits new keys K2, derived from the group it built

each key K2 has a count

 

R2 sums the counts for each K2

 

The output of R1 could be fed directly into R2. But I can't find a way
to do that in Hadoop. So I have to create a second job, which has to
have a Map phase, so I create a pass-through mapper. This works but it
has a lot of overhead. It would be faster & cleaner to run R1 directly
into R2 within the same job - if possible.

 

 

From:
mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

 

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of
R1 to be aggregated in some way ) , you have to write another map-red
job. If you need to process only local reducer data ( ie your reducer
output key is same as input key ),  your job would be M1-R1-M2.
Essentially in Hadoop, you can have one sort and shuffle phase in one
job.
Note that chain APIs are for jobs of the form (M+RM*).  

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com>
wrote:

These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile

Re: chained mappers & reducers

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ),  your job would be M1-R1-M2. Essentially in Hadoop, you can have one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com> wrote:

These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile