You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by abc xyz <fa...@yahoo.com> on 2010/07/03 18:33:39 UTC

Partitioned Datasets Map/Reduce

Hello everyone,

I have written my custom partitioner for partitioning datasets. I want  to 
partition two datasets using the same partitioner and then in the  next 
mapreduce job, I want each mapper to handle the same partition from  the two 
sources and perform some function such as joining etc. How I  can I ensure that 
one mapper gets the split that corresponds to same  partition from both the 
sources? 


Any help would be highly appreciated.


      

Re: Partitioned Datasets Map/Reduce

Posted by abc xyz <fa...@yahoo.com>.

well, I want to do some experimentation with hadoop. I need to partition two 
datasets using same partitioning function and then in the next job, take the 
same partition from both datasets and apply some operation in the mapper. But 
how to ensure to get the same partition from both sources in one mapper??



________________________________
From: Hemanth Yamijala <yh...@gmail.com>
To: general@hadoop.apache.org
Sent: Tue, July 6, 2010 5:40:49 AM
Subject: Re: Partitioned Datasets Map/Reduce

Hi,

> I have written my custom partitioner for partitioning datasets. I want  to
> partition two datasets using the same partitioner and then in the  next
> mapreduce job, I want each mapper to handle the same partition from  the two
> sources and perform some function such as joining etc. How I  can I ensure 
that
> one mapper gets the split that corresponds to same  partition from both the
> sources?
>

Not really an answer to your specific question, but have you taken a
look at Pig (http://hadoop.apache.org/pig) which is suitable for
operations like Joining data sets ?



      

Re: Partitioned Datasets Map/Reduce

Posted by Hemanth Yamijala <yh...@gmail.com>.
Hi,

> I have written my custom partitioner for partitioning datasets. I want  to
> partition two datasets using the same partitioner and then in the  next
> mapreduce job, I want each mapper to handle the same partition from  the two
> sources and perform some function such as joining etc. How I  can I ensure that
> one mapper gets the split that corresponds to same  partition from both the
> sources?
>

Not really an answer to your specific question, but have you taken a
look at Pig (http://hadoop.apache.org/pig) which is suitable for
operations like Joining data sets ?