You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Richa Khandelwal <ri...@gmail.com> on 2009/03/04 20:18:28 UTC

Repartitioned Joins

Hi All,
Does anyone know of tweaking in map-reduce joins that will optimize it
further in terms of the moving only those tuples to reduce phase that join
in the two tables? There are replicated joins and semi-join strategies but
they are more of databases than map-reduce.

Thanks,
Richa Khandelwal
University Of California,
Santa Cruz.
Ph:425-241-7763

Re: Repartitioned Joins

Posted by Aaron Kimball <aa...@cloudera.com>.

Richa,

Since the mappers run independently, you'd have a hard time
determining whether a record in mapper A would be joined by a record
in mapper B. The solution, as it were, would be to do this in two
separate MapReduce passes:

* Take an educated guess at which table is the smaller data set.
* Run a MapReduce over this dataset, building up a bloom filter for
the record ids. Set entries in the filter to 1 for each record id you
see; leave the rest as 0.
* The bloom filter now has 1 meaning "maybe joinable" and 0 meaning
"definitely not joinable."
* Run a second MapReduce job over both datasets. Use the distributed
cache to send the filter to all mappers. Mappers emit all records
where filter[hash(record_id)] == 1.

- Aaron

On Wed, Mar 4, 2009 at 11:18 AM, Richa Khandelwal <ri...@gmail.com> wrote:
> Hi All,
> Does anyone know of tweaking in map-reduce joins that will optimize it
> further in terms of the moving only those tuples to reduce phase that join
> in the two tables? There are replicated joins and semi-join strategies but
> they are more of databases than map-reduce.
>
> Thanks,
> Richa Khandelwal
> University Of California,
> Santa Cruz.
> Ph:425-241-7763
>