You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by "Ankur C. Goel" <ga...@yahoo-inc.com> on 2010/06/01 09:09:56 UTC

Re: Pig facility analogous to SQL's IN?

If data represented by relation B can fit in memory than you can simply use a "replicated" join which is inexpensive and is a map-side join.

 C = JOIN A by a2, B by b1 USING "replicated";

-@nkur


On 5/31/10 3:32 PM, "BalaSundaraRaman" <su...@yahoo.com> wrote:

Hi,

Is there any operator or UDF in Pig similar to the IN operator of SQL?
Specifically, given a large bag A and a very small single-column bag B, I want to select tuples in A with a field a1 that has its value in B.
My current method of doing it using a JOIN (below) seems very expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by b1;

It'll be very useful if such an operator is available for use in FILTER and SPLIT as well.
For example, if I need to substitute '0' when a2 is NOT IN B::b1, currently, there's no easy way, I guess.


Thanks,
Sundar (a Pig n00b)

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture

Re: Pig facility analogous to SQL's IN?

Posted by BalaSundaraRaman <su...@yahoo.com>.

Thanks for the explanation, Alan. Got it.

- Sundar

 "That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
> From: Alan Gates <ga...@yahoo-inc.com>
> To: pig-user@hadoop.apache.org
> Sent: Wed, June 2, 2010 9:35:24 PM
> Subject: Re: Pig facility analogous to SQL's IN?
> 
> The semantic of join is that all records from input 1 with a key value k will be 
> joined with all records from input 2 with that same key value.  With one 
> large input and one small input, this can be accomplished by loading the small 
> input into memory on every mapper regardless of how the large input is split 
> into maps by Map Reduce.  That is, for all the keys with value k in the 
> large input, some may be assigned to map 1 and some to map 2, and join will 
> still work.

The semantic of cogroup is that at the end of the cogroup 
> statement all keys from both inputs will be collected together into bags (one 
> for each input).  The only way to do this is in the map is to guarantee 
> that all keys with value k are in the same map.  That means that the 
> InputFormat used to split the data across maps must be aware of the values of 
> the keys and produce splits accordingly.  Zebra is the only storage format 
> I'm aware of that can do this.

All this said it would obviously be nice 
> if Pig could analyze the script and figure out whether the user truly needs this 
> stronger semantic of cogroup or whether he is just using cogroup as a join, and 
> where possible rewrite it.  But Pig's optimizer isn't there 
> yet.

Alan.


On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman 
> wrote:

> Thanks Alan. I'm definitely interested in knowing why it 
> won't work in cogroup the same way.
> 
> Will try to implement the 
> IN UDF, though, I've only written simple eval udf's only so far.
> 
> 
> - Sundar
> 
> "That language is an instrument of human 
> reason, and not merely a medium for the expression of thought, is a truth 
> generally admitted."
> - George Boole, quoted in Iverson's Turing Award 
> Lecture
> 
> 
> 
> ----- Original Message 
> ----
>> From: Alan Gates <
> href="mailto:gates@yahoo-inc.com">gates@yahoo-inc.com>
>> To: 
> ymailto="mailto:pig-user@hadoop.apache.org" 
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org
>> 
> Sent: Tue, June 1, 2010 11:02:31 PM
>> Subject: Re: Pig facility 
> analogous to SQL's IN?
>> 
>> In general mapside cogroups are 
> not possible unless the underlying storage
>> mechanism can guarantee 
> that all instances of a the key you are cogrouping on
>> are in a 
> single map instance.  At this point only Zebra can guarantee
>> 
> that.  If you're interested I can give more details on why join works 
> and
>> cogroup doesn't.
> 
> You can do IN for filter 
> without needing a full mapside
>> cogroup.  You could implement 
> this via a UDF that loads the small bag into
>> a hash table and probes 
> the table for each record it is
>> passed.
> 
> 
> Alan.
> 
> On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman
>> 
> wrote:
> 
>> Thanks Ankur. But, in my actual case, it's a COGROUP 
> and not
>> a join.
>> "replicated" can't be used with COGROUP, 
> no?
>> Any work
>> around?
>> 
>> - 
> Sundar
>> 
>> "That language is an
>> instrument of 
> human reason, and not merely a medium for the expression of
>> thought, 
> is a truth generally admitted."
>> - George Boole, quoted 
> in
>> Iverson's Turing Award Lecture
>> 
>> 
> 
>> 
>> ----- Original
>> Message 
> ----
>>> From: Ankur C. Goel <
>> ymailto="mailto:
> ymailto="mailto:gankur@yahoo-inc.com" 
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com"
>> 
> href="mailto:
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com">
> ymailto="mailto:gankur@yahoo-inc.com" 
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
>>> 
> To:
>> "
>> href="mailto:
> ymailto="mailto:pig-user@hadoop.apache.org" 
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org">
> ymailto="mailto:pig-user@hadoop.apache.org" 
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org" 
> <
>> ymailto="mailto:
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org"
>> 
> href="mailto:
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org">
> ymailto="mailto:pig-user@hadoop.apache.org" 
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
>>> 
> 
>> Sent: Tue, June 1, 2010 12:39:56 PM
>>> Subject: Re: 
> Pig facility
>> analogous to SQL's IN?
>>> 
>>> 
> If data represented by relation
>> B can fit in memory than you can 
> simply use a
>>> "replicated" join
>> which is inexpensive 
> and is a map-side join.
>> 
>> C =
>> 
> JOIN
>>> A by a2, B by b1 USING "replicated";
>> 
> 
>> 
>> -@nkur
>> 
>> 
>> On 
> 5/31/10 3:32
>>> PM,
>> "BalaSundaraRaman" 
> <
>>> href="mailto:
>> ymailto="mailto:
> ymailto="mailto:sundarbecse@yahoo.com" 
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com"
>> 
> href="mailto:
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com" 
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
>> 
> ymailto="mailto:
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com"
>> 
> href="mailto:
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com" 
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>>> 
> 
>> wrote:
>> 
>> Hi,
>> 
>> Is 
> there any operator or UDF in Pig
>> similar to the IN
>>> 
> operator of SQL?
>> Specifically, given a
>> large bag A and a 
> very small
>>> single-column bag B, I want to select
>> 
> tuples in A with a field a1 that has its
>>> value in B.
>> 
> My
>> current method of doing it using a JOIN (below) seems 
> very
>>> 
>> expensive.
>> grunt> A = LOAD 
> '/tmp/a.txt' USING PigStorage(',')
>> AS
>>> 
> (a1:chararray,a2:chararray);
>> grunt> B = LOAD
>> 
> '/tmp/b.txt' USING
>>> PigStorage(',') AS 
> (b1:chararray);
>> 
>> grunt> C = JOIN A by a2, B 
> by
>>> b1;
>> 
>> It'll be very
>> useful 
> if such an operator is available for use in
>>> FILTER and 
> SPLIT
>> as well.
>> For example, if I need to substitute '0' 
> when a2 is
>>> 
>> NOT IN B::b1, currently, there's no easy 
> way, I
>>> guess.
>> 
>> 
>> 
>> 
> Thanks,
>> Sundar (a Pig n00b)
>> 
>> 
> "That
>> language is an
>>> instrument of human reason, and 
> not merely a medium
>> for the expression of
>>> thought, 
> is a truth generally
>> admitted."
>> - George Boole, quoted 
> in Iverson's
>>> Turing Award
>> Lecture

Re: Pig facility analogous to SQL's IN?

Posted by Alan Gates <ga...@yahoo-inc.com>.

The semantic of join is that all records from input 1 with a key value  
k will be joined with all records from input 2 with that same key  
value.  With one large input and one small input, this can be  
accomplished by loading the small input into memory on every mapper  
regardless of how the large input is split into maps by Map Reduce.   
That is, for all the keys with value k in the large input, some may be  
assigned to map 1 and some to map 2, and join will still work.

The semantic of cogroup is that at the end of the cogroup statement  
all keys from both inputs will be collected together into bags (one  
for each input).  The only way to do this is in the map is to  
guarantee that all keys with value k are in the same map.  That means  
that the InputFormat used to split the data across maps must be aware  
of the values of the keys and produce splits accordingly.  Zebra is  
the only storage format I'm aware of that can do this.

All this said it would obviously be nice if Pig could analyze the  
script and figure out whether the user truly needs this stronger  
semantic of cogroup or whether he is just using cogroup as a join, and  
where possible rewrite it.  But Pig's optimizer isn't there yet.

Alan.


On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman wrote:

> Thanks Alan. I'm definitely interested in knowing why it won't work  
> in cogroup the same way.
>
> Will try to implement the IN UDF, though, I've only written simple  
> eval udf's only so far.
>
> - Sundar
>
> "That language is an instrument of human reason, and not merely a  
> medium for the expression of thought, is a truth generally admitted."
> - George Boole, quoted in Iverson's Turing Award Lecture
>
>
>
> ----- Original Message ----
>> From: Alan Gates <ga...@yahoo-inc.com>
>> To: pig-user@hadoop.apache.org
>> Sent: Tue, June 1, 2010 11:02:31 PM
>> Subject: Re: Pig facility analogous to SQL's IN?
>>
>> In general mapside cogroups are not possible unless the underlying  
>> storage
>> mechanism can guarantee that all instances of a the key you are  
>> cogrouping on
>> are in a single map instance.  At this point only Zebra can guarantee
>> that.  If you're interested I can give more details on why join  
>> works and
>> cogroup doesn't.
>
> You can do IN for filter without needing a full mapside
>> cogroup.  You could implement this via a UDF that loads the small  
>> bag into
>> a hash table and probes the table for each record it is
>> passed.
>
> Alan.
>
> On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman
>> wrote:
>
>> Thanks Ankur. But, in my actual case, it's a COGROUP and not
>> a join.
>> "replicated" can't be used with COGROUP, no?
>> Any work
>> around?
>>
>> - Sundar
>>
>> "That language is an
>> instrument of human reason, and not merely a medium for the  
>> expression of
>> thought, is a truth generally admitted."
>> - George Boole, quoted in
>> Iverson's Turing Award Lecture
>>
>>
>>
>> ----- Original
>> Message ----
>>> From: Ankur C. Goel <
>> ymailto="mailto:gankur@yahoo-inc.com"
>> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
>>> To:
>> "
>> href="mailto:pig-user@hadoop.apache.org">pig- 
>> user@hadoop.apache.org" <
>> ymailto="mailto:pig-user@hadoop.apache.org"
>> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
>>>
>> Sent: Tue, June 1, 2010 12:39:56 PM
>>> Subject: Re: Pig facility
>> analogous to SQL's IN?
>>>
>>> If data represented by relation
>> B can fit in memory than you can simply use a
>>> "replicated" join
>> which is inexpensive and is a map-side join.
>>
>> C =
>> JOIN
>>> A by a2, B by b1 USING "replicated";
>>
>>
>> -@nkur
>>
>>
>> On 5/31/10 3:32
>>> PM,
>> "BalaSundaraRaman" <
>>> href="mailto:
>> ymailto="mailto:sundarbecse@yahoo.com"
>> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
>> ymailto="mailto:sundarbecse@yahoo.com"
>> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>>>
>> wrote:
>>
>> Hi,
>>
>> Is there any operator or UDF in Pig
>> similar to the IN
>>> operator of SQL?
>> Specifically, given a
>> large bag A and a very small
>>> single-column bag B, I want to select
>> tuples in A with a field a1 that has its
>>> value in B.
>> My
>> current method of doing it using a JOIN (below) seems very
>>>
>> expensive.
>> grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',')
>> AS
>>> (a1:chararray,a2:chararray);
>> grunt> B = LOAD
>> '/tmp/b.txt' USING
>>> PigStorage(',') AS (b1:chararray);
>>
>> grunt> C = JOIN A by a2, B by
>>> b1;
>>
>> It'll be very
>> useful if such an operator is available for use in
>>> FILTER and SPLIT
>> as well.
>> For example, if I need to substitute '0' when a2 is
>>>
>> NOT IN B::b1, currently, there's no easy way, I
>>> guess.
>>
>>
>>
>> Thanks,
>> Sundar (a Pig n00b)
>>
>> "That
>> language is an
>>> instrument of human reason, and not merely a medium
>> for the expression of
>>> thought, is a truth generally
>> admitted."
>> - George Boole, quoted in Iverson's
>>> Turing Award
>> Lecture

Re: Pig facility analogous to SQL's IN?

Posted by BalaSundaraRaman <su...@yahoo.com>.

Thanks Alan. I'm definitely interested in knowing why it won't work in cogroup the same way.

Will try to implement the IN UDF, though, I've only written simple eval udf's only so far.

- Sundar

 "That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
> From: Alan Gates <ga...@yahoo-inc.com>
> To: pig-user@hadoop.apache.org
> Sent: Tue, June 1, 2010 11:02:31 PM
> Subject: Re: Pig facility analogous to SQL's IN?
> 
> In general mapside cogroups are not possible unless the underlying storage 
> mechanism can guarantee that all instances of a the key you are cogrouping on 
> are in a single map instance.  At this point only Zebra can guarantee 
> that.  If you're interested I can give more details on why join works and 
> cogroup doesn't.

You can do IN for filter without needing a full mapside 
> cogroup.  You could implement this via a UDF that loads the small bag into 
> a hash table and probes the table for each record it is 
> passed.

Alan.

On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman 
> wrote:

> Thanks Ankur. But, in my actual case, it's a COGROUP and not 
> a join.
> "replicated" can't be used with COGROUP, no?
> Any work 
> around?
> 
> - Sundar
> 
> "That language is an 
> instrument of human reason, and not merely a medium for the expression of 
> thought, is a truth generally admitted."
> - George Boole, quoted in 
> Iverson's Turing Award Lecture
> 
> 
> 
> ----- Original 
> Message ----
>> From: Ankur C. Goel <
> ymailto="mailto:gankur@yahoo-inc.com" 
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
>> To: 
> "
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org" <
> ymailto="mailto:pig-user@hadoop.apache.org" 
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
>> 
> Sent: Tue, June 1, 2010 12:39:56 PM
>> Subject: Re: Pig facility 
> analogous to SQL's IN?
>> 
>> If data represented by relation 
> B can fit in memory than you can simply use a
>> "replicated" join 
> which is inexpensive and is a map-side join.
> 
> C = 
> JOIN
>> A by a2, B by b1 USING "replicated";
> 
> 
> -@nkur
> 
> 
> On 5/31/10 3:32
>> PM, 
> "BalaSundaraRaman" <
>> href="mailto:
> ymailto="mailto:sundarbecse@yahoo.com" 
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com" 
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>> 
> wrote:
> 
> Hi,
> 
> Is there any operator or UDF in Pig 
> similar to the IN
>> operator of SQL?
> Specifically, given a 
> large bag A and a very small
>> single-column bag B, I want to select 
> tuples in A with a field a1 that has its
>> value in B.
> My 
> current method of doing it using a JOIN (below) seems very
>> 
> expensive.
> grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') 
> AS
>> (a1:chararray,a2:chararray);
> grunt> B = LOAD 
> '/tmp/b.txt' USING
>> PigStorage(',') AS (b1:chararray);
> 
> grunt> C = JOIN A by a2, B by
>> b1;
> 
> It'll be very 
> useful if such an operator is available for use in
>> FILTER and SPLIT 
> as well.
> For example, if I need to substitute '0' when a2 is
>> 
> NOT IN B::b1, currently, there's no easy way, I
>> guess.
> 
> 
> 
> Thanks,
> Sundar (a Pig n00b)
> 
> "That 
> language is an
>> instrument of human reason, and not merely a medium 
> for the expression of
>> thought, is a truth generally 
> admitted."
> - George Boole, quoted in Iverson's
>> Turing Award 
> Lecture

Re: Pig facility analogous to SQL's IN?

Posted by Alan Gates <ga...@yahoo-inc.com>.

In general mapside cogroups are not possible unless the underlying  
storage mechanism can guarantee that all instances of a the key you  
are cogrouping on are in a single map instance.  At this point only  
Zebra can guarantee that.  If you're interested I can give more  
details on why join works and cogroup doesn't.

You can do IN for filter without needing a full mapside cogroup.  You  
could implement this via a UDF that loads the small bag into a hash  
table and probes the table for each record it is passed.

Alan.

On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman wrote:

> Thanks Ankur. But, in my actual case, it's a COGROUP and not a join.
> "replicated" can't be used with COGROUP, no?
> Any work around?
>
> - Sundar
>
> "That language is an instrument of human reason, and not merely a  
> medium for the expression of thought, is a truth generally admitted."
> - George Boole, quoted in Iverson's Turing Award Lecture
>
>
>
> ----- Original Message ----
>> From: Ankur C. Goel <ga...@yahoo-inc.com>
>> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
>> Sent: Tue, June 1, 2010 12:39:56 PM
>> Subject: Re: Pig facility analogous to SQL's IN?
>>
>> If data represented by relation B can fit in memory than you can  
>> simply use a
>> "replicated" join which is inexpensive and is a map-side join.
>
> C = JOIN
>> A by a2, B by b1 USING "replicated";
>
> -@nkur
>
>
> On 5/31/10 3:32
>> PM, "BalaSundaraRaman" <
>> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>> wrote:
>
> Hi,
>
> Is there any operator or UDF in Pig similar to the IN
>> operator of SQL?
> Specifically, given a large bag A and a very small
>> single-column bag B, I want to select tuples in A with a field a1  
>> that has its
>> value in B.
> My current method of doing it using a JOIN (below) seems very
>> expensive.
> grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS
>> (a1:chararray,a2:chararray);
> grunt> B = LOAD '/tmp/b.txt' USING
>> PigStorage(',') AS (b1:chararray);
> grunt> C = JOIN A by a2, B by
>> b1;
>
> It'll be very useful if such an operator is available for use in
>> FILTER and SPLIT as well.
> For example, if I need to substitute '0' when a2 is
>> NOT IN B::b1, currently, there's no easy way, I
>> guess.
>
>
> Thanks,
> Sundar (a Pig n00b)
>
> "That language is an
>> instrument of human reason, and not merely a medium for the  
>> expression of
>> thought, is a truth generally admitted."
> - George Boole, quoted in Iverson's
>> Turing Award Lecture

Re: Pig facility analogous to SQL's IN?

Posted by BalaSundaraRaman <su...@yahoo.com>.

Will try, Ankur. Thanks.

- Sundar

 "That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
> From: Ankur C. Goel <ga...@yahoo-inc.com>
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
> Sent: Wed, June 2, 2010 4:58:26 PM
> Subject: Re: Pig facility analogous to SQL's IN?
> 
> For the case you described, you can do a right outer replicated join followed by 
> a projection to substitute '0' for missing values.

-@nkur


On 
> 6/1/10 1:15 PM, "BalaSundaraRaman" <
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com> 
> wrote:

Thanks Ankur. But, in my actual case, it's a COGROUP and not a 
> join.
"replicated" can't be used with COGROUP, no?
Any work 
> around?

- Sundar

"That language is an instrument of human reason, 
> and not merely a medium for the expression of thought, is a truth generally 
> admitted."
- George Boole, quoted in Iverson's Turing Award 
> Lecture



----- Original Message ----
> From: Ankur C. Goel 
> <
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
> To: "
> ymailto="mailto:pig-user@hadoop.apache.org" 
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org" <
> ymailto="mailto:pig-user@hadoop.apache.org" 
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
> 
> Sent: Tue, June 1, 2010 12:39:56 PM
> Subject: Re: Pig facility analogous 
> to SQL's IN?
>
> If data represented by relation B can fit in memory 
> than you can simply use a
> "replicated" join which is inexpensive and is 
> a map-side join.

C = JOIN
> A by a2, B by b1 USING 
> "replicated";

-@nkur


On 5/31/10 3:32
> PM, 
> "BalaSundaraRaman" <
> href="mailto:
> ymailto="mailto:sundarbecse@yahoo.com" 
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com" 
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
> 
> wrote:

Hi,

Is there any operator or UDF in Pig similar to the 
> IN
> operator of SQL?
Specifically, given a large bag A and a very 
> small
> single-column bag B, I want to select tuples in A with a field a1 
> that has its
> value in B.
My current method of doing it using a JOIN 
> (below) seems very
> expensive.
grunt> A = LOAD '/tmp/a.txt' USING 
> PigStorage(',') AS
> (a1:chararray,a2:chararray);
grunt> B = LOAD 
> '/tmp/b.txt' USING
> PigStorage(',') AS (b1:chararray);
grunt> C = 
> JOIN A by a2, B by
> b1;

It'll be very useful if such an operator 
> is available for use in
> FILTER and SPLIT as well.
For example, if I 
> need to substitute '0' when a2 is
> NOT IN B::b1, currently, there's no 
> easy way, I
> guess.


Thanks,
Sundar (a Pig 
> n00b)

"That language is an
> instrument of human reason, and not 
> merely a medium for the expression of
> thought, is a truth generally 
> admitted."
- George Boole, quoted in Iverson's
> Turing Award 
> Lecture

Re: Pig facility analogous to SQL's IN?

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.

For the case you described, you can do a right outer replicated join followed by a projection to substitute '0' for missing values.

-@nkur


On 6/1/10 1:15 PM, "BalaSundaraRaman" <su...@yahoo.com> wrote:

Thanks Ankur. But, in my actual case, it's a COGROUP and not a join.
"replicated" can't be used with COGROUP, no?
Any work around?

- Sundar

 "That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
> From: Ankur C. Goel <ga...@yahoo-inc.com>
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
> Sent: Tue, June 1, 2010 12:39:56 PM
> Subject: Re: Pig facility analogous to SQL's IN?
>
> If data represented by relation B can fit in memory than you can simply use a
> "replicated" join which is inexpensive and is a map-side join.

C = JOIN
> A by a2, B by b1 USING "replicated";

-@nkur


On 5/31/10 3:32
> PM, "BalaSundaraRaman" <
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
> wrote:

Hi,

Is there any operator or UDF in Pig similar to the IN
> operator of SQL?
Specifically, given a large bag A and a very small
> single-column bag B, I want to select tuples in A with a field a1 that has its
> value in B.
My current method of doing it using a JOIN (below) seems very
> expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS
> (a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING
> PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by
> b1;

It'll be very useful if such an operator is available for use in
> FILTER and SPLIT as well.
For example, if I need to substitute '0' when a2 is
> NOT IN B::b1, currently, there's no easy way, I
> guess.


Thanks,
Sundar (a Pig n00b)

"That language is an
> instrument of human reason, and not merely a medium for the expression of
> thought, is a truth generally admitted."
- George Boole, quoted in Iverson's
> Turing Award Lecture

Re: Pig facility analogous to SQL's IN?

Posted by BalaSundaraRaman <su...@yahoo.com>.

Thanks Ankur. But, in my actual case, it's a COGROUP and not a join.
"replicated" can't be used with COGROUP, no?
Any work around?

- Sundar

 "That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
> From: Ankur C. Goel <ga...@yahoo-inc.com>
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
> Sent: Tue, June 1, 2010 12:39:56 PM
> Subject: Re: Pig facility analogous to SQL's IN?
> 
> If data represented by relation B can fit in memory than you can simply use a 
> "replicated" join which is inexpensive and is a map-side join.

C = JOIN 
> A by a2, B by b1 USING "replicated";

-@nkur


On 5/31/10 3:32 
> PM, "BalaSundaraRaman" <
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com> 
> wrote:

Hi,

Is there any operator or UDF in Pig similar to the IN 
> operator of SQL?
Specifically, given a large bag A and a very small 
> single-column bag B, I want to select tuples in A with a field a1 that has its 
> value in B.
My current method of doing it using a JOIN (below) seems very 
> expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS 
> (a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING 
> PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by 
> b1;

It'll be very useful if such an operator is available for use in 
> FILTER and SPLIT as well.
For example, if I need to substitute '0' when a2 is 
> NOT IN B::b1, currently, there's no easy way, I 
> guess.


Thanks,
Sundar (a Pig n00b)

"That language is an 
> instrument of human reason, and not merely a medium for the expression of 
> thought, is a truth generally admitted."
- George Boole, quoted in Iverson's 
> Turing Award Lecture