You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Thomas Edison <ju...@gmail.com> on 2013/05/22 02:28:50 UTC

Is there a way to do replicated inner join and filter on another column at once?

Here is a code sample:

a = load 'fact' as (dim_key:chararray, fact_value:int);
b = load 'dim';

c = join a by dim_key, b by dim_key using 'replicated';
d = filter c by fact_value > 10;

dump d;

Let's assume both c and d will filter out a lot of records.  Is there a way
these two step can be done in one scan of the fact data, rather two?  Or
the optimization is smart enough to figure out and only do one scan?

Thanks.

T.E.

Re: Is there a way to do replicated inner join and filter on another column at once?

Posted by John Meagher <jo...@gmail.com>.
Filter first and it will do it in a single scan and will make the join faster.
http://pig.apache.org/docs/r0.11.1/perf.html#filter

On Tue, May 21, 2013 at 8:28 PM, Thomas Edison
<ju...@gmail.com> wrote:
> Here is a code sample:
>
> a = load 'fact' as (dim_key:chararray, fact_value:int);
> b = load 'dim';
>
> c = join a by dim_key, b by dim_key using 'replicated';
> d = filter c by fact_value > 10;
>
> dump d;
>
> Let's assume both c and d will filter out a lot of records.  Is there a way
> these two step can be done in one scan of the fact data, rather two?  Or
> the optimization is smart enough to figure out and only do one scan?
>
> Thanks.
>
> T.E.