You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Ankur C. Goel" <ga...@yahoo-inc.com> on 2010/06/01 09:09:56 UTC
Re: Pig facility analogous to SQL's IN?
If data represented by relation B can fit in memory than you can simply use a "replicated" join which is inexpensive and is a map-side join.
C = JOIN A by a2, B by b1 USING "replicated";
-@nkur
On 5/31/10 3:32 PM, "BalaSundaraRaman" <su...@yahoo.com> wrote:
Hi,
Is there any operator or UDF in Pig similar to the IN operator of SQL?
Specifically, given a large bag A and a very small single-column bag B, I want to select tuples in A with a field a1 that has its value in B.
My current method of doing it using a JOIN (below) seems very expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by b1;
It'll be very useful if such an operator is available for use in FILTER and SPLIT as well.
For example, if I need to substitute '0' when a2 is NOT IN B::b1, currently, there's no easy way, I guess.
Thanks,
Sundar (a Pig n00b)
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
Re: Pig facility analogous to SQL's IN?
Posted by BalaSundaraRaman <su...@yahoo.com>.
Thanks for the explanation, Alan. Got it.
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
> From: Alan Gates <ga...@yahoo-inc.com>
> To: pig-user@hadoop.apache.org
> Sent: Wed, June 2, 2010 9:35:24 PM
> Subject: Re: Pig facility analogous to SQL's IN?
>
> The semantic of join is that all records from input 1 with a key value k will be
> joined with all records from input 2 with that same key value. With one
> large input and one small input, this can be accomplished by loading the small
> input into memory on every mapper regardless of how the large input is split
> into maps by Map Reduce. That is, for all the keys with value k in the
> large input, some may be assigned to map 1 and some to map 2, and join will
> still work.
The semantic of cogroup is that at the end of the cogroup
> statement all keys from both inputs will be collected together into bags (one
> for each input). The only way to do this is in the map is to guarantee
> that all keys with value k are in the same map. That means that the
> InputFormat used to split the data across maps must be aware of the values of
> the keys and produce splits accordingly. Zebra is the only storage format
> I'm aware of that can do this.
All this said it would obviously be nice
> if Pig could analyze the script and figure out whether the user truly needs this
> stronger semantic of cogroup or whether he is just using cogroup as a join, and
> where possible rewrite it. But Pig's optimizer isn't there
> yet.
Alan.
On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman
> wrote:
> Thanks Alan. I'm definitely interested in knowing why it
> won't work in cogroup the same way.
>
> Will try to implement the
> IN UDF, though, I've only written simple eval udf's only so far.
>
>
> - Sundar
>
> "That language is an instrument of human
> reason, and not merely a medium for the expression of thought, is a truth
> generally admitted."
> - George Boole, quoted in Iverson's Turing Award
> Lecture
>
>
>
> ----- Original Message
> ----
>> From: Alan Gates <
> href="mailto:gates@yahoo-inc.com">gates@yahoo-inc.com>
>> To:
> ymailto="mailto:pig-user@hadoop.apache.org"
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org
>>
> Sent: Tue, June 1, 2010 11:02:31 PM
>> Subject: Re: Pig facility
> analogous to SQL's IN?
>>
>> In general mapside cogroups are
> not possible unless the underlying storage
>> mechanism can guarantee
> that all instances of a the key you are cogrouping on
>> are in a
> single map instance. At this point only Zebra can guarantee
>>
> that. If you're interested I can give more details on why join works
> and
>> cogroup doesn't.
>
> You can do IN for filter
> without needing a full mapside
>> cogroup. You could implement
> this via a UDF that loads the small bag into
>> a hash table and probes
> the table for each record it is
>> passed.
>
>
> Alan.
>
> On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman
>>
> wrote:
>
>> Thanks Ankur. But, in my actual case, it's a COGROUP
> and not
>> a join.
>> "replicated" can't be used with COGROUP,
> no?
>> Any work
>> around?
>>
>> -
> Sundar
>>
>> "That language is an
>> instrument of
> human reason, and not merely a medium for the expression of
>> thought,
> is a truth generally admitted."
>> - George Boole, quoted
> in
>> Iverson's Turing Award Lecture
>>
>>
>
>>
>> ----- Original
>> Message
> ----
>>> From: Ankur C. Goel <
>> ymailto="mailto:
> ymailto="mailto:gankur@yahoo-inc.com"
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com"
>>
> href="mailto:
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com">
> ymailto="mailto:gankur@yahoo-inc.com"
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
>>>
> To:
>> "
>> href="mailto:
> ymailto="mailto:pig-user@hadoop.apache.org"
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org">
> ymailto="mailto:pig-user@hadoop.apache.org"
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org"
> <
>> ymailto="mailto:
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org"
>>
> href="mailto:
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org">
> ymailto="mailto:pig-user@hadoop.apache.org"
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
>>>
>
>> Sent: Tue, June 1, 2010 12:39:56 PM
>>> Subject: Re:
> Pig facility
>> analogous to SQL's IN?
>>>
>>>
> If data represented by relation
>> B can fit in memory than you can
> simply use a
>>> "replicated" join
>> which is inexpensive
> and is a map-side join.
>>
>> C =
>>
> JOIN
>>> A by a2, B by b1 USING "replicated";
>>
>
>>
>> -@nkur
>>
>>
>> On
> 5/31/10 3:32
>>> PM,
>> "BalaSundaraRaman"
> <
>>> href="mailto:
>> ymailto="mailto:
> ymailto="mailto:sundarbecse@yahoo.com"
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com"
>>
> href="mailto:
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com"
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
>>
> ymailto="mailto:
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com"
>>
> href="mailto:
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com"
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>>>
>
>> wrote:
>>
>> Hi,
>>
>> Is
> there any operator or UDF in Pig
>> similar to the IN
>>>
> operator of SQL?
>> Specifically, given a
>> large bag A and a
> very small
>>> single-column bag B, I want to select
>>
> tuples in A with a field a1 that has its
>>> value in B.
>>
> My
>> current method of doing it using a JOIN (below) seems
> very
>>>
>> expensive.
>> grunt> A = LOAD
> '/tmp/a.txt' USING PigStorage(',')
>> AS
>>>
> (a1:chararray,a2:chararray);
>> grunt> B = LOAD
>>
> '/tmp/b.txt' USING
>>> PigStorage(',') AS
> (b1:chararray);
>>
>> grunt> C = JOIN A by a2, B
> by
>>> b1;
>>
>> It'll be very
>> useful
> if such an operator is available for use in
>>> FILTER and
> SPLIT
>> as well.
>> For example, if I need to substitute '0'
> when a2 is
>>>
>> NOT IN B::b1, currently, there's no easy
> way, I
>>> guess.
>>
>>
>>
>>
> Thanks,
>> Sundar (a Pig n00b)
>>
>>
> "That
>> language is an
>>> instrument of human reason, and
> not merely a medium
>> for the expression of
>>> thought,
> is a truth generally
>> admitted."
>> - George Boole, quoted
> in Iverson's
>>> Turing Award
>> Lecture
Re: Pig facility analogous to SQL's IN?
Posted by Alan Gates <ga...@yahoo-inc.com>.
The semantic of join is that all records from input 1 with a key value
k will be joined with all records from input 2 with that same key
value. With one large input and one small input, this can be
accomplished by loading the small input into memory on every mapper
regardless of how the large input is split into maps by Map Reduce.
That is, for all the keys with value k in the large input, some may be
assigned to map 1 and some to map 2, and join will still work.
The semantic of cogroup is that at the end of the cogroup statement
all keys from both inputs will be collected together into bags (one
for each input). The only way to do this is in the map is to
guarantee that all keys with value k are in the same map. That means
that the InputFormat used to split the data across maps must be aware
of the values of the keys and produce splits accordingly. Zebra is
the only storage format I'm aware of that can do this.
All this said it would obviously be nice if Pig could analyze the
script and figure out whether the user truly needs this stronger
semantic of cogroup or whether he is just using cogroup as a join, and
where possible rewrite it. But Pig's optimizer isn't there yet.
Alan.
On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman wrote:
> Thanks Alan. I'm definitely interested in knowing why it won't work
> in cogroup the same way.
>
> Will try to implement the IN UDF, though, I've only written simple
> eval udf's only so far.
>
> - Sundar
>
> "That language is an instrument of human reason, and not merely a
> medium for the expression of thought, is a truth generally admitted."
> - George Boole, quoted in Iverson's Turing Award Lecture
>
>
>
> ----- Original Message ----
>> From: Alan Gates <ga...@yahoo-inc.com>
>> To: pig-user@hadoop.apache.org
>> Sent: Tue, June 1, 2010 11:02:31 PM
>> Subject: Re: Pig facility analogous to SQL's IN?
>>
>> In general mapside cogroups are not possible unless the underlying
>> storage
>> mechanism can guarantee that all instances of a the key you are
>> cogrouping on
>> are in a single map instance. At this point only Zebra can guarantee
>> that. If you're interested I can give more details on why join
>> works and
>> cogroup doesn't.
>
> You can do IN for filter without needing a full mapside
>> cogroup. You could implement this via a UDF that loads the small
>> bag into
>> a hash table and probes the table for each record it is
>> passed.
>
> Alan.
>
> On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman
>> wrote:
>
>> Thanks Ankur. But, in my actual case, it's a COGROUP and not
>> a join.
>> "replicated" can't be used with COGROUP, no?
>> Any work
>> around?
>>
>> - Sundar
>>
>> "That language is an
>> instrument of human reason, and not merely a medium for the
>> expression of
>> thought, is a truth generally admitted."
>> - George Boole, quoted in
>> Iverson's Turing Award Lecture
>>
>>
>>
>> ----- Original
>> Message ----
>>> From: Ankur C. Goel <
>> ymailto="mailto:gankur@yahoo-inc.com"
>> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
>>> To:
>> "
>> href="mailto:pig-user@hadoop.apache.org">pig-
>> user@hadoop.apache.org" <
>> ymailto="mailto:pig-user@hadoop.apache.org"
>> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
>>>
>> Sent: Tue, June 1, 2010 12:39:56 PM
>>> Subject: Re: Pig facility
>> analogous to SQL's IN?
>>>
>>> If data represented by relation
>> B can fit in memory than you can simply use a
>>> "replicated" join
>> which is inexpensive and is a map-side join.
>>
>> C =
>> JOIN
>>> A by a2, B by b1 USING "replicated";
>>
>>
>> -@nkur
>>
>>
>> On 5/31/10 3:32
>>> PM,
>> "BalaSundaraRaman" <
>>> href="mailto:
>> ymailto="mailto:sundarbecse@yahoo.com"
>> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
>> ymailto="mailto:sundarbecse@yahoo.com"
>> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>>>
>> wrote:
>>
>> Hi,
>>
>> Is there any operator or UDF in Pig
>> similar to the IN
>>> operator of SQL?
>> Specifically, given a
>> large bag A and a very small
>>> single-column bag B, I want to select
>> tuples in A with a field a1 that has its
>>> value in B.
>> My
>> current method of doing it using a JOIN (below) seems very
>>>
>> expensive.
>> grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',')
>> AS
>>> (a1:chararray,a2:chararray);
>> grunt> B = LOAD
>> '/tmp/b.txt' USING
>>> PigStorage(',') AS (b1:chararray);
>>
>> grunt> C = JOIN A by a2, B by
>>> b1;
>>
>> It'll be very
>> useful if such an operator is available for use in
>>> FILTER and SPLIT
>> as well.
>> For example, if I need to substitute '0' when a2 is
>>>
>> NOT IN B::b1, currently, there's no easy way, I
>>> guess.
>>
>>
>>
>> Thanks,
>> Sundar (a Pig n00b)
>>
>> "That
>> language is an
>>> instrument of human reason, and not merely a medium
>> for the expression of
>>> thought, is a truth generally
>> admitted."
>> - George Boole, quoted in Iverson's
>>> Turing Award
>> Lecture
Re: Pig facility analogous to SQL's IN?
Posted by BalaSundaraRaman <su...@yahoo.com>.
Thanks Alan. I'm definitely interested in knowing why it won't work in cogroup the same way.
Will try to implement the IN UDF, though, I've only written simple eval udf's only so far.
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
> From: Alan Gates <ga...@yahoo-inc.com>
> To: pig-user@hadoop.apache.org
> Sent: Tue, June 1, 2010 11:02:31 PM
> Subject: Re: Pig facility analogous to SQL's IN?
>
> In general mapside cogroups are not possible unless the underlying storage
> mechanism can guarantee that all instances of a the key you are cogrouping on
> are in a single map instance. At this point only Zebra can guarantee
> that. If you're interested I can give more details on why join works and
> cogroup doesn't.
You can do IN for filter without needing a full mapside
> cogroup. You could implement this via a UDF that loads the small bag into
> a hash table and probes the table for each record it is
> passed.
Alan.
On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman
> wrote:
> Thanks Ankur. But, in my actual case, it's a COGROUP and not
> a join.
> "replicated" can't be used with COGROUP, no?
> Any work
> around?
>
> - Sundar
>
> "That language is an
> instrument of human reason, and not merely a medium for the expression of
> thought, is a truth generally admitted."
> - George Boole, quoted in
> Iverson's Turing Award Lecture
>
>
>
> ----- Original
> Message ----
>> From: Ankur C. Goel <
> ymailto="mailto:gankur@yahoo-inc.com"
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
>> To:
> "
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org" <
> ymailto="mailto:pig-user@hadoop.apache.org"
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
>>
> Sent: Tue, June 1, 2010 12:39:56 PM
>> Subject: Re: Pig facility
> analogous to SQL's IN?
>>
>> If data represented by relation
> B can fit in memory than you can simply use a
>> "replicated" join
> which is inexpensive and is a map-side join.
>
> C =
> JOIN
>> A by a2, B by b1 USING "replicated";
>
>
> -@nkur
>
>
> On 5/31/10 3:32
>> PM,
> "BalaSundaraRaman" <
>> href="mailto:
> ymailto="mailto:sundarbecse@yahoo.com"
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com"
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>>
> wrote:
>
> Hi,
>
> Is there any operator or UDF in Pig
> similar to the IN
>> operator of SQL?
> Specifically, given a
> large bag A and a very small
>> single-column bag B, I want to select
> tuples in A with a field a1 that has its
>> value in B.
> My
> current method of doing it using a JOIN (below) seems very
>>
> expensive.
> grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',')
> AS
>> (a1:chararray,a2:chararray);
> grunt> B = LOAD
> '/tmp/b.txt' USING
>> PigStorage(',') AS (b1:chararray);
>
> grunt> C = JOIN A by a2, B by
>> b1;
>
> It'll be very
> useful if such an operator is available for use in
>> FILTER and SPLIT
> as well.
> For example, if I need to substitute '0' when a2 is
>>
> NOT IN B::b1, currently, there's no easy way, I
>> guess.
>
>
>
> Thanks,
> Sundar (a Pig n00b)
>
> "That
> language is an
>> instrument of human reason, and not merely a medium
> for the expression of
>> thought, is a truth generally
> admitted."
> - George Boole, quoted in Iverson's
>> Turing Award
> Lecture
Re: Pig facility analogous to SQL's IN?
Posted by Alan Gates <ga...@yahoo-inc.com>.
In general mapside cogroups are not possible unless the underlying
storage mechanism can guarantee that all instances of a the key you
are cogrouping on are in a single map instance. At this point only
Zebra can guarantee that. If you're interested I can give more
details on why join works and cogroup doesn't.
You can do IN for filter without needing a full mapside cogroup. You
could implement this via a UDF that loads the small bag into a hash
table and probes the table for each record it is passed.
Alan.
On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman wrote:
> Thanks Ankur. But, in my actual case, it's a COGROUP and not a join.
> "replicated" can't be used with COGROUP, no?
> Any work around?
>
> - Sundar
>
> "That language is an instrument of human reason, and not merely a
> medium for the expression of thought, is a truth generally admitted."
> - George Boole, quoted in Iverson's Turing Award Lecture
>
>
>
> ----- Original Message ----
>> From: Ankur C. Goel <ga...@yahoo-inc.com>
>> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
>> Sent: Tue, June 1, 2010 12:39:56 PM
>> Subject: Re: Pig facility analogous to SQL's IN?
>>
>> If data represented by relation B can fit in memory than you can
>> simply use a
>> "replicated" join which is inexpensive and is a map-side join.
>
> C = JOIN
>> A by a2, B by b1 USING "replicated";
>
> -@nkur
>
>
> On 5/31/10 3:32
>> PM, "BalaSundaraRaman" <
>> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>> wrote:
>
> Hi,
>
> Is there any operator or UDF in Pig similar to the IN
>> operator of SQL?
> Specifically, given a large bag A and a very small
>> single-column bag B, I want to select tuples in A with a field a1
>> that has its
>> value in B.
> My current method of doing it using a JOIN (below) seems very
>> expensive.
> grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS
>> (a1:chararray,a2:chararray);
> grunt> B = LOAD '/tmp/b.txt' USING
>> PigStorage(',') AS (b1:chararray);
> grunt> C = JOIN A by a2, B by
>> b1;
>
> It'll be very useful if such an operator is available for use in
>> FILTER and SPLIT as well.
> For example, if I need to substitute '0' when a2 is
>> NOT IN B::b1, currently, there's no easy way, I
>> guess.
>
>
> Thanks,
> Sundar (a Pig n00b)
>
> "That language is an
>> instrument of human reason, and not merely a medium for the
>> expression of
>> thought, is a truth generally admitted."
> - George Boole, quoted in Iverson's
>> Turing Award Lecture
Re: Pig facility analogous to SQL's IN?
Posted by BalaSundaraRaman <su...@yahoo.com>.
Will try, Ankur. Thanks.
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
> From: Ankur C. Goel <ga...@yahoo-inc.com>
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
> Sent: Wed, June 2, 2010 4:58:26 PM
> Subject: Re: Pig facility analogous to SQL's IN?
>
> For the case you described, you can do a right outer replicated join followed by
> a projection to substitute '0' for missing values.
-@nkur
On
> 6/1/10 1:15 PM, "BalaSundaraRaman" <
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
> wrote:
Thanks Ankur. But, in my actual case, it's a COGROUP and not a
> join.
"replicated" can't be used with COGROUP, no?
Any work
> around?
- Sundar
"That language is an instrument of human reason,
> and not merely a medium for the expression of thought, is a truth generally
> admitted."
- George Boole, quoted in Iverson's Turing Award
> Lecture
----- Original Message ----
> From: Ankur C. Goel
> <
> href="mailto:gankur@yahoo-inc.com">gankur@yahoo-inc.com>
> To: "
> ymailto="mailto:pig-user@hadoop.apache.org"
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org" <
> ymailto="mailto:pig-user@hadoop.apache.org"
> href="mailto:pig-user@hadoop.apache.org">pig-user@hadoop.apache.org>
>
> Sent: Tue, June 1, 2010 12:39:56 PM
> Subject: Re: Pig facility analogous
> to SQL's IN?
>
> If data represented by relation B can fit in memory
> than you can simply use a
> "replicated" join which is inexpensive and is
> a map-side join.
C = JOIN
> A by a2, B by b1 USING
> "replicated";
-@nkur
On 5/31/10 3:32
> PM,
> "BalaSundaraRaman" <
> href="mailto:
> ymailto="mailto:sundarbecse@yahoo.com"
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com">
> ymailto="mailto:sundarbecse@yahoo.com"
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
>
> wrote:
Hi,
Is there any operator or UDF in Pig similar to the
> IN
> operator of SQL?
Specifically, given a large bag A and a very
> small
> single-column bag B, I want to select tuples in A with a field a1
> that has its
> value in B.
My current method of doing it using a JOIN
> (below) seems very
> expensive.
grunt> A = LOAD '/tmp/a.txt' USING
> PigStorage(',') AS
> (a1:chararray,a2:chararray);
grunt> B = LOAD
> '/tmp/b.txt' USING
> PigStorage(',') AS (b1:chararray);
grunt> C =
> JOIN A by a2, B by
> b1;
It'll be very useful if such an operator
> is available for use in
> FILTER and SPLIT as well.
For example, if I
> need to substitute '0' when a2 is
> NOT IN B::b1, currently, there's no
> easy way, I
> guess.
Thanks,
Sundar (a Pig
> n00b)
"That language is an
> instrument of human reason, and not
> merely a medium for the expression of
> thought, is a truth generally
> admitted."
- George Boole, quoted in Iverson's
> Turing Award
> Lecture
Re: Pig facility analogous to SQL's IN?
Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.
For the case you described, you can do a right outer replicated join followed by a projection to substitute '0' for missing values.
-@nkur
On 6/1/10 1:15 PM, "BalaSundaraRaman" <su...@yahoo.com> wrote:
Thanks Ankur. But, in my actual case, it's a COGROUP and not a join.
"replicated" can't be used with COGROUP, no?
Any work around?
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
> From: Ankur C. Goel <ga...@yahoo-inc.com>
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
> Sent: Tue, June 1, 2010 12:39:56 PM
> Subject: Re: Pig facility analogous to SQL's IN?
>
> If data represented by relation B can fit in memory than you can simply use a
> "replicated" join which is inexpensive and is a map-side join.
C = JOIN
> A by a2, B by b1 USING "replicated";
-@nkur
On 5/31/10 3:32
> PM, "BalaSundaraRaman" <
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
> wrote:
Hi,
Is there any operator or UDF in Pig similar to the IN
> operator of SQL?
Specifically, given a large bag A and a very small
> single-column bag B, I want to select tuples in A with a field a1 that has its
> value in B.
My current method of doing it using a JOIN (below) seems very
> expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS
> (a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING
> PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by
> b1;
It'll be very useful if such an operator is available for use in
> FILTER and SPLIT as well.
For example, if I need to substitute '0' when a2 is
> NOT IN B::b1, currently, there's no easy way, I
> guess.
Thanks,
Sundar (a Pig n00b)
"That language is an
> instrument of human reason, and not merely a medium for the expression of
> thought, is a truth generally admitted."
- George Boole, quoted in Iverson's
> Turing Award Lecture
Re: Pig facility analogous to SQL's IN?
Posted by BalaSundaraRaman <su...@yahoo.com>.
Thanks Ankur. But, in my actual case, it's a COGROUP and not a join.
"replicated" can't be used with COGROUP, no?
Any work around?
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
> From: Ankur C. Goel <ga...@yahoo-inc.com>
> To: "pig-user@hadoop.apache.org" <pi...@hadoop.apache.org>
> Sent: Tue, June 1, 2010 12:39:56 PM
> Subject: Re: Pig facility analogous to SQL's IN?
>
> If data represented by relation B can fit in memory than you can simply use a
> "replicated" join which is inexpensive and is a map-side join.
C = JOIN
> A by a2, B by b1 USING "replicated";
-@nkur
On 5/31/10 3:32
> PM, "BalaSundaraRaman" <
> href="mailto:sundarbecse@yahoo.com">sundarbecse@yahoo.com>
> wrote:
Hi,
Is there any operator or UDF in Pig similar to the IN
> operator of SQL?
Specifically, given a large bag A and a very small
> single-column bag B, I want to select tuples in A with a field a1 that has its
> value in B.
My current method of doing it using a JOIN (below) seems very
> expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS
> (a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING
> PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by
> b1;
It'll be very useful if such an operator is available for use in
> FILTER and SPLIT as well.
For example, if I need to substitute '0' when a2 is
> NOT IN B::b1, currently, there's no easy way, I
> guess.
Thanks,
Sundar (a Pig n00b)
"That language is an
> instrument of human reason, and not merely a medium for the expression of
> thought, is a truth generally admitted."
- George Boole, quoted in Iverson's
> Turing Award Lecture