You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2011/06/30 03:05:25 UTC

Is there a way to insure that different jobs have the same number of reducers

I am trying to run an application where I try to generate the cartesion
product of two potentially large data sets. In reality I only need the
cartesian product of
values in the set with a particular integer key. I am considering a design
where the first mappers run through the values of set A emitting that
integer as a key and the item as a value. The reducers are simple identity
reducers.
In the second job the mappers run through set B emitting values with a key
and the item as a value. The reducers read the output of the first job to
run through the values of A.
One issue is that assuming the same hashing partitioner is used there are
the same number of reducers, a specific reducer , say reducer 12 ,
will receive the same keys in both jobs and thus  part-r-00012 from the
first job is the only file reducer 12 will need to read.
Can I guarantee (without restricting the number of reducers to a smaller
number than the cluster will support) that this condition is met - namely
that the keys in the second job hit the same reducer number as the first
job? What about restarts and failures?
BTW is there any way to find out the size of a cluster??

-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Is there a way to insure that different jobs have the same number of reducers

Posted by Trevor Adams <tr...@gmail.com>.

Exact same bucket is possible, exact same machine (if that is what you had
in mind) probably not. The partitioner breaks the data up for the reducers,
so if they map to the same partition they will be done by the same reducer.
If you can partition the data such that the output of one reducer partitions
to 1 bucket and is not split then you can get all the data going to one
reducer. Doing it this way means there needs to be some transient property
that carries over from the step 1 reducer and through the step 2 mapper.
Most cases, I would assume, do not have that property.

-Trevor

On Wed, Jun 29, 2011 at 9:05 PM, Steve Lewis <lo...@gmail.com> wrote:

> I am trying to run an application where I try to generate the cartesion
> product of two potentially large data sets. In reality I only need the
> cartesian product of
> values in the set with a particular integer key. I am considering a design
> where the first mappers run through the values of set A emitting that
> integer as a key and the item as a value. The reducers are simple identity
> reducers.
> In the second job the mappers run through set B emitting values with a key
> and the item as a value. The reducers read the output of the first job to
> run through the values of A.
> One issue is that assuming the same hashing partitioner is used there are
> the same number of reducers, a specific reducer , say reducer 12 ,
> will receive the same keys in both jobs and thus  part-r-00012 from the
> first job is the only file reducer 12 will need to read.
> Can I guarantee (without restricting the number of reducers to a smaller
> number than the cluster will support) that this condition is met - namely
> that the keys in the second job hit the same reducer number as the first
> job? What about restarts and failures?
> BTW is there any way to find out the size of a cluster??
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>
>