You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/10/08 01:35:05 UTC

[Pig Wiki] Update of "JoinFramework" by OlgaN

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/JoinFramework

New page:
= Join Framework =

== Objective == 

This document provides a comprehensive view of performing joins in Pig. By =JOIN= here we mean traditional inner/outer =SQL= joins which in Pig are realized via =COGROUP= followed by =flatten= of the relations.

Some of the approaches described in this document can also be applied to =CROSS= and =GROUP= as well. 

== Joins ==

Currently, Pig running on top of Hadoop executes all joins in the same way. During the map stage, the data from each relation is annotated with the index of that relation. Then, the data is sorted and partitioned by the join key and provided to the reducer. This is similar to SQL's =hash join=. In the next generation Pig (currently on types branch), the data from the same relation is guaranteed to be continuous for the same key. This is to allow optimization that only keep =N-1= relations in memory. (Unfortunately, we did not see the expected speedup when this optimization was tried - investigation is still in progress.)

In some situations, more efficient join implementations can be constructed if more is known about the data of the relations. They are described in the section.

=== Pre-partitioned Join (PPJ) === 

This join type takes advantage of the fact that the data of all relations is already partition by the join key or its prefix which means that the join can be done completely independently on separate nodes. It further helps if the data is sorted on the key; otherwise it might have to get sorted before the join.

In the case of =Hadoop=, this means that the join can be done in a =Map= avoiding =SORT/SHUFFLE/REDUCE= stages. The performance would be even better if the partitions for the same key ranges were collocated on the same nodes and if the computation was scheduled to run on this nodes. However, for now this is outside of Pig's control.

Note that GROUP can take advantage of this knowledge as well.

[Discussion of different data layout options.]

=== Fragment Replicate Join (FRJ) === 

This join type takes advantage of the fact that N-1 relations in the join are very small and can fit into main memory of each node. In this case, the small tables can be copied onto all the nodes and be joined with the data from the larger table. This saves the cost of sorting and partitioning the large table. 
For Hadoop this means that the join can happen on the map side. 

The data coming out of the join is not guaranteed to be sorted on the join key which could cause problems for queries that follow join by =GROUP= or =ORDER BY= on the prefix of the join key. This should be taken into account when choosing join type.

If you have several larger tables in the join that can't fit into memory, it might be beneficial to split the join to fit FRJ pattern since it would significantly reduce the size of the data going into the next join and might even allow to use FRJ again.

Note that CROSS can take advantage of this approach as well.

=== Indexed Join (IJ) ===

This join type takes advantage of the fact that one or more tables participating in the join have index on the group by key or its prefix. This is similar in structure to FRJ join but could be even more efficient since processing time can be proportional to the size of the non-indexed and hopefully smaller table.

In Hadoop, this will also result in a map side join. 

Currently neither Pig nor Hadoop have indexing structure. So getting to this point might take some time and needs some compelling use cases to make the investment.