You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Clements, Michael" <Mi...@disney.com> on 2010/01/19 21:59:18 UTC
chained mappers & reducers
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.
Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.
Thanks
Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile
RE: chained mappers & reducers
Posted by "Clements, Michael" <Mi...@disney.com>.
That is what I figured. I saw the "ChainedReducer" class and got all
excited about chaining multiple reducers; but that is not what this
class does.
I see your point about it being in general impossible to chain reducers.
Chained mappers are OK, since each is emitting K-V pairs, there is no
need for any to have a full global set of all keys or values. But
chaining reducers is much more difficult, as each reducer needs the full
set of values for each key, which can't be known by a single reducer
instance.
Thanks
From:
mapreduce-user-return-315-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-315-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Amogh Vasekar
Sent: Thursday, January 21, 2010 4:17 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers
Unless you can somehow guarantee that a certain output key K1 comes only
from reducer R1 ( seems very unlikely & somewhat useless in your case )
, I'm afraid you'll need to have a subsequent MR job. The thing is
Hadoop has no "in-built" mechanism for reducers to exchange data :)
Amogh
On 1/21/10 12:30 AM, "Clements, Michael" <Mi...@disney.com>
wrote:
The use case is this: M1-R1-R2
M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count
M2: identity pass-through
R2: sum counts by K2
In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count
R2 sums the counts for each K2
The output of R1 could be fed directly into R2. But I can't find a way
to do that in Hadoop. So I have to create a second job, which has to
have a Map phase, so I create a pass-through mapper. This works but it
has a lot of overhead. It would be faster & cleaner to run R1 directly
into R2 within the same job - if possible.
From:
mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers
Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of
R1 to be aggregated in some way ) , you have to write another map-red
job. If you need to process only local reducer data ( ie your reducer
output key is same as input key ), your job would be M1-R1-M2.
Essentially in Hadoop, you can have one sort and shuffle phase in one
job.
Note that chain APIs are for jobs of the form (M+RM*).
Amogh
On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com>
wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.
Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.
Thanks
Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile
Re: chained mappers & reducers
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Unless you can somehow guarantee that a certain output key K1 comes only from reducer R1 ( seems very unlikely & somewhat useless in your case ) , I'm afraid you'll need to have a subsequent MR job. The thing is Hadoop has no "in-built" mechanism for reducers to exchange data :)
Amogh
On 1/21/10 12:30 AM, "Clements, Michael" <Mi...@disney.com> wrote:
The use case is this: M1-R1-R2
M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count
M2: identity pass-through
R2: sum counts by K2
In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count
R2 sums the counts for each K2
The output of R1 could be fed directly into R2. But I can't find a way to do that in Hadoop. So I have to create a second job, which has to have a Map phase, so I create a pass-through mapper. This works but it has a lot of overhead. It would be faster & cleaner to run R1 directly into R2 within the same job - if possible.
From: mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org [mailto:mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers
Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ), your job would be M1-R1-M2. Essentially in Hadoop, you can have one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).
Amogh
On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com> wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.
Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.
Thanks
Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile
RE: chained mappers & reducers
Posted by "Clements, Michael" <Mi...@disney.com>.
The use case is this: M1-R1-R2
M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count
M2: identity pass-through
R2: sum counts by K2
In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count
R2 sums the counts for each K2
The output of R1 could be fed directly into R2. But I can't find a way
to do that in Hadoop. So I have to create a second job, which has to
have a Map phase, so I create a pass-through mapper. This works but it
has a lot of overhead. It would be faster & cleaner to run R1 directly
into R2 within the same job - if possible.
From:
mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-302-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers
Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of
R1 to be aggregated in some way ) , you have to write another map-red
job. If you need to process only local reducer data ( ie your reducer
output key is same as input key ), your job would be M1-R1-M2.
Essentially in Hadoop, you can have one sort and shuffle phase in one
job.
Note that chain APIs are for jobs of the form (M+RM*).
Amogh
On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com>
wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.
Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.
Thanks
Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile
Re: chained mappers & reducers
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ), your job would be M1-R1-M2. Essentially in Hadoop, you can have one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).
Amogh
On 1/20/10 2:29 AM, "Clements, Michael" <Mi...@disney.com> wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.
Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.
Thanks
Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile