You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Aniket Mokashi <am...@andrew.cmu.edu> on 2011/04/15 05:21:28 UTC

Filter on contents of other dataset

Hi,

What would be the best way to write this script?
I have two datasets - huge (hkey, hdata), small(skey). I want to filter
all the data from huge dataset for which F(hdata, skey) is true.
Please advise.

For example,
huge = load 'mydata' as (key:chararray, value:chararray);
small = load 'smalldata' as skey:chararray;
h_s_cross = cross huge, small;
filtered = foreach h_s_cross generate CONTAINS(value, skey);

Thanks,
Aniket


Re: Filter on contents of other dataset

Posted by Alan Gates <ga...@yahoo-inc.com>.
Is your comparison function equals or is there some transformation  
that could be applied to hdata and skey so it could be equals?  If so  
you could use semi join instead, which should be much more efficient.

Alan.

On Apr 14, 2011, at 8:21 PM, Aniket Mokashi wrote:

> Hi,
>
> What would be the best way to write this script?
> I have two datasets - huge (hkey, hdata), small(skey). I want to  
> filter
> all the data from huge dataset for which F(hdata, skey) is true.
> Please advise.
>
> For example,
> huge = load 'mydata' as (key:chararray, value:chararray);
> small = load 'smalldata' as skey:chararray;
> h_s_cross = cross huge, small;
> filtered = foreach h_s_cross generate CONTAINS(value, skey);
>
> Thanks,
> Aniket
>


Re: Filter on contents of other dataset

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
You could either distribute the small file using distributed cache - in 
which case, you can use direct file api to load content from the file, 
or directly use hdfs api's to load from each task ... usually 
distributed cache should work better, but ymmv !


Regards,
Mridul

On Friday 15 April 2011 09:10 AM, Aniket Mokashi wrote:
> Thanks Mridul,
>
> (Although, small might grow bigger) For instance, lets have small as
> in-memory-small stored in a local file.
>
> When does my udf load the data from the file. Earlier, I wrote a bag
> loader that returns a bag of small data (eg- load 'smalldata' using
> BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata,
> smallbag) to make this work.
>
> I think your solution would solve my problem, but how do I make my udf
> read file? Can you give me some pointers?
>
> Thanks,
> Aniket
>
>
> On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote:
>>
>
>> The way you described it, it does look like an application of cross.
>>
>>
>> How 'small' is small ?
>> If it is pretty small, you can avoid the shuffle/reduce phase and
>> directly stream huge through a udf which does a task local cross with
>> 'small' (assuming it fits in memory).
>>
>>
>>
>> %define my_udf MYUDF('smalldata')
>>
>>
>> huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered =
>> FILTER huge BY my_udf(hkey, hdata);
>>
>>
>>
>>
>> Where my_udf returns true if there exists some skey in smalldata for
>> which F(hdata, skey) is true - as you defined.
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:
>>
>>> Hi,
>>>
>>>
>>> What would be the best way to write this script?
>>> I have two datasets - huge (hkey, hdata), small(skey). I want to filter
>>> all the data from huge dataset for which F(hdata, skey) is true. Please
>>> advise.
>>>
>>> For example,
>>> huge = load 'mydata' as (key:chararray, value:chararray); small = load
>>> 'smalldata' as skey:chararray;
>>> h_s_cross = cross huge, small; filtered = foreach h_s_cross generate
>>> CONTAINS(value, skey);
>>>
>>>
>>> Thanks,
>>> Aniket
>>>
>>>
>>
>>
>>
>
>


Re: Filter on contents of other dataset

Posted by Aniket Mokashi <am...@andrew.cmu.edu>.
Thanks Mridul,

(Although, small might grow bigger) For instance, lets have small as
in-memory-small stored in a local file.

When does my udf load the data from the file. Earlier, I wrote a bag
loader that returns a bag of small data (eg- load 'smalldata' using
BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata,
smallbag) to make this work.

I think your solution would solve my problem, but how do I make my udf
read file? Can you give me some pointers?

Thanks,
Aniket


On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote:
>

> The way you described it, it does look like an application of cross.
>
>
> How 'small' is small ?
> If it is pretty small, you can avoid the shuffle/reduce phase and
> directly stream huge through a udf which does a task local cross with
> 'small' (assuming it fits in memory).
>
>
>
> %define my_udf MYUDF('smalldata')
>
>
> huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered =
> FILTER huge BY my_udf(hkey, hdata);
>
>
>
>
> Where my_udf returns true if there exists some skey in smalldata for
> which F(hdata, skey) is true - as you defined.
>
>
> Regards,
> Mridul
>
>
> On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:
>
>> Hi,
>>
>>
>> What would be the best way to write this script?
>> I have two datasets - huge (hkey, hdata), small(skey). I want to filter
>> all the data from huge dataset for which F(hdata, skey) is true. Please
>> advise.
>>
>> For example,
>> huge = load 'mydata' as (key:chararray, value:chararray); small = load
>> 'smalldata' as skey:chararray;
>> h_s_cross = cross huge, small; filtered = foreach h_s_cross generate
>> CONTAINS(value, skey);
>>
>>
>> Thanks,
>> Aniket
>>
>>
>
>
>



Re: Filter on contents of other dataset

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
The way you described it, it does look like an application of cross.

How 'small' is small ?
If it is pretty small, you can avoid the shuffle/reduce phase and 
directly stream huge through a udf which does a task local cross with 
'small' (assuming it fits in memory).


%define my_udf MYUDF('smalldata')

huge = load 'mydata' as (hkey:chararray, hdata:chararray);
filtered = FILTER huge BY my_udf(hkey, hdata);



Where my_udf returns true if there exists some skey in smalldata for 
which F(hdata, skey) is true - as you defined.


Regards,
Mridul

On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:
> Hi,
>
> What would be the best way to write this script?
> I have two datasets - huge (hkey, hdata), small(skey). I want to filter
> all the data from huge dataset for which F(hdata, skey) is true.
> Please advise.
>
> For example,
> huge = load 'mydata' as (key:chararray, value:chararray);
> small = load 'smalldata' as skey:chararray;
> h_s_cross = cross huge, small;
> filtered = foreach h_s_cross generate CONTAINS(value, skey);
>
> Thanks,
> Aniket
>