You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/09/04 04:26:41 UTC

caching maps locally and joining against them.

I've been thinking more about the GROUP BY problem I talked about the other
day, specifically the fact that Pig does a bad job dealing with relations
that were already GROUPed and reGROUPS them.

I was thinking that perhaps one way to deal with this would be to support
the concept of joins against local code which implements the Map<K,V>
interface.

The way replicated joins works right now is similar to this:

Fragment replicate join is a special type of join that works well if one or
> more relations are small enough to fit into main memory. In such cases, Pig
> can perform a very efficient join because all of the hadoop work is done on
> the map side. In this type of join the large relation is followed by one or
> more small relations. The small relations must be small enough to fit into
> main memory; if they don't, the process fails and an error is generated.


This could be similar to replicated joins except you just page through one
side of the relation since it's already sorted/grouped and setup for that
partition.

Any pre-sorted/grouped dataset which needs to be continually joined against
would benefit from this optimization.

Kevin

-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: caching maps locally and joining against them.

Posted by Daniel Dai <da...@hortonworks.com>.
The easiest way to achieve this is to write a UDF and do the second
cogroup by yourself. Change your script:

foo_grouped = GROUP foo BY col_a;
both_grouped = GROUP foo_grouped BY $0 , bar BY col_a;

=>

store foo_grouped into 'foo_grouped';
bar_grouped = GROUP bar BY col_a;
both_grouped = foreach bar_grouped generate MERGE_UDF(*);

In MERGE_UDF, you will need to:
1. Find the index of the reducer. One way is through
PigMapReduce.sJobConf("mapred.task.id"), there might be a better way
2. Find the part file of 'foo_grouped' corresponding to the index
3. If part file is small enough, you can load into the memory and
simulate a fragment-replicated cogroup; otherwise, you can do a merge
cogroup.

Daniel

On Sat, Sep 3, 2011 at 7:26 PM, Kevin Burton <bu...@spinn3r.com> wrote:
> I've been thinking more about the GROUP BY problem I talked about the other
> day, specifically the fact that Pig does a bad job dealing with relations
> that were already GROUPed and reGROUPS them.
>
> I was thinking that perhaps one way to deal with this would be to support
> the concept of joins against local code which implements the Map<K,V>
> interface.
>
> The way replicated joins works right now is similar to this:
>
> Fragment replicate join is a special type of join that works well if one or
>> more relations are small enough to fit into main memory. In such cases, Pig
>> can perform a very efficient join because all of the hadoop work is done on
>> the map side. In this type of join the large relation is followed by one or
>> more small relations. The small relations must be small enough to fit into
>> main memory; if they don't, the process fails and an error is generated.
>
>
> This could be similar to replicated joins except you just page through one
> side of the relation since it's already sorted/grouped and setup for that
> partition.
>
> Any pre-sorted/grouped dataset which needs to be continually joined against
> would benefit from this optimization.
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*
>