You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Alexander SchÃ¤tzle <al...@yahoo.com> on 2010/06/08 13:45:53 UTC

Behavior of JOIN

Hi all,

the JOIN operator of Pig produces duplicate columns in its output.
Let's say the statement is like this:

C = JOIN A BY (var1, var2), B BY (var1, var2);

Then C contains var1 and var2 two times (one for each input relation), of course with the same content.
This is somehow not what a user "usually" expects from a Join.
Why does Pig produce such redundant entries?
If you want to get rid of these entries you have to do a FOREACH for projection.
Otherwise you shuffle unnecessary data through MR-phases.
In my opinion this is somehow really unnecessary.
I just wonder why Pig produces theo output of a Join the way it does?

Cheers,
Alex

Re: Behavior of JOIN

Posted by hc busy <hc...@gmail.com>.

Oh, I see what my confusion is... It's the "null"s on which join behaves
differently in pig than sql. Right? that's where things are different.


On Thu, Jun 10, 2010 at 12:48 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> That's already what happens, because flattening a bag that is empty results
> in 0 rows, regardless of how many rows came out of the other bag.
>
> Alan.
>
>
> On Jun 10, 2010, at 11:09 AM, hc busy wrote:
>
>  Isn't that kind of annoying? Since JOIN in sql implicitly is an inner
>> join.
>> Would have been great if
>>
>> C = JOIN A by id, B b id;
>>
>> is alias for
>> C1 = COGROUP A by id, B by id;
>> C2 = filter C1 by IsEmpty(A) OR IsEmpty(B);
>> C = foreach C2 generate FLATTEN(A), FLATTEN(B);
>>
>>
>> On Tue, Jun 8, 2010 at 12:03 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>  Historically
>>>
>>> C = JOIN A by a, B by a
>>>
>>> was defined in Pig Latin as shorthand for:
>>>
>>> C1 = COGROUP A by a, B by a;
>>> C = FOREACH C1 GENERATE flatten(A), flatten(B)
>>>
>>> which produces the doubling of keys.
>>>
>>> Also, given that Pig Latin does not require that key names be the same
>>> (as
>>> USING or NATURAL do in SQL) there would be issues if it did not have both
>>> keys in the output.  (For the same reason ON in SQL duplicates the keys
>>> in
>>> the results.)
>>>
>>> Alan.
>>>
>>>
>>> On Jun 8, 2010, at 4:45 AM, Alexander SchÃ¤tzle wrote:
>>>
>>> Hi all,
>>>
>>>>
>>>> the JOIN operator of Pig produces duplicate columns in its output.
>>>> Let's say the statement is like this:
>>>>
>>>> C = JOIN A BY (var1, var2), B BY (var1, var2);
>>>>
>>>> Then C contains var1 and var2 two times (one for each input relation),
>>>> of
>>>> course with the same content.
>>>> This is somehow not what a user "usually" expects from a Join.
>>>> Why does Pig produce such redundant entries?
>>>> If you want to get rid of these entries you have to do a FOREACH for
>>>> projection.
>>>> Otherwise you shuffle unnecessary data through MR-phases.
>>>> In my opinion this is somehow really unnecessary.
>>>> I just wonder why Pig produces theo output of a Join the way it does?
>>>>
>>>> Cheers,
>>>> Alex
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: Behavior of JOIN

Posted by Alan Gates <ga...@yahoo-inc.com>.

That's already what happens, because flattening a bag that is empty  
results in 0 rows, regardless of how many rows came out of the other  
bag.

Alan.

On Jun 10, 2010, at 11:09 AM, hc busy wrote:

> Isn't that kind of annoying? Since JOIN in sql implicitly is an  
> inner join.
> Would have been great if
>
> C = JOIN A by id, B b id;
>
> is alias for
> C1 = COGROUP A by id, B by id;
> C2 = filter C1 by IsEmpty(A) OR IsEmpty(B);
> C = foreach C2 generate FLATTEN(A), FLATTEN(B);
>
>
> On Tue, Jun 8, 2010 at 12:03 PM, Alan Gates <ga...@yahoo-inc.com>  
> wrote:
>
>> Historically
>>
>> C = JOIN A by a, B by a
>>
>> was defined in Pig Latin as shorthand for:
>>
>> C1 = COGROUP A by a, B by a;
>> C = FOREACH C1 GENERATE flatten(A), flatten(B)
>>
>> which produces the doubling of keys.
>>
>> Also, given that Pig Latin does not require that key names be the  
>> same (as
>> USING or NATURAL do in SQL) there would be issues if it did not  
>> have both
>> keys in the output.  (For the same reason ON in SQL duplicates the  
>> keys in
>> the results.)
>>
>> Alan.
>>
>>
>> On Jun 8, 2010, at 4:45 AM, Alexander SchÃ¤tzle wrote:
>>
>> Hi all,
>>>
>>> the JOIN operator of Pig produces duplicate columns in its output.
>>> Let's say the statement is like this:
>>>
>>> C = JOIN A BY (var1, var2), B BY (var1, var2);
>>>
>>> Then C contains var1 and var2 two times (one for each input  
>>> relation), of
>>> course with the same content.
>>> This is somehow not what a user "usually" expects from a Join.
>>> Why does Pig produce such redundant entries?
>>> If you want to get rid of these entries you have to do a FOREACH for
>>> projection.
>>> Otherwise you shuffle unnecessary data through MR-phases.
>>> In my opinion this is somehow really unnecessary.
>>> I just wonder why Pig produces theo output of a Join the way it  
>>> does?
>>>
>>> Cheers,
>>> Alex
>>>
>>>
>>>
>>

Re: pig configuration

Posted by Ashutosh Chauhan <as...@gmail.com>.

ls $PIG_HOME/conf

log4j.properties : put your log4j properties then specify it using -4
option on pig command line.
pig-default.properties : properties used by pig. Modifying properties
in this file will have no effect.
pig.properties: All user properties go in here.

Hope it helps,
Ashutosh

On Thu, Jun 10, 2010 at 14:37, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> I want to know if there are some configuration files in control of the temperate files and log files generated in pig.
>
> Thanks,
> -Gang
>
>
>
>
>

pig configuration

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi all, 
I want to know if there are some configuration files in control of the temperate files and log files generated in pig.

Thanks,
-Gang

Re: Behavior of JOIN

Posted by hc busy <hc...@gmail.com>.

Isn't that kind of annoying? Since JOIN in sql implicitly is an inner join.
Would have been great if

C = JOIN A by id, B b id;

is alias for
C1 = COGROUP A by id, B by id;
C2 = filter C1 by IsEmpty(A) OR IsEmpty(B);
C = foreach C2 generate FLATTEN(A), FLATTEN(B);


On Tue, Jun 8, 2010 at 12:03 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Historically
>
> C = JOIN A by a, B by a
>
> was defined in Pig Latin as shorthand for:
>
> C1 = COGROUP A by a, B by a;
> C = FOREACH C1 GENERATE flatten(A), flatten(B)
>
> which produces the doubling of keys.
>
> Also, given that Pig Latin does not require that key names be the same (as
> USING or NATURAL do in SQL) there would be issues if it did not have both
> keys in the output.  (For the same reason ON in SQL duplicates the keys in
> the results.)
>
> Alan.
>
>
> On Jun 8, 2010, at 4:45 AM, Alexander SchÃ¤tzle wrote:
>
>  Hi all,
>>
>> the JOIN operator of Pig produces duplicate columns in its output.
>> Let's say the statement is like this:
>>
>> C = JOIN A BY (var1, var2), B BY (var1, var2);
>>
>> Then C contains var1 and var2 two times (one for each input relation), of
>> course with the same content.
>> This is somehow not what a user "usually" expects from a Join.
>> Why does Pig produce such redundant entries?
>> If you want to get rid of these entries you have to do a FOREACH for
>> projection.
>> Otherwise you shuffle unnecessary data through MR-phases.
>> In my opinion this is somehow really unnecessary.
>> I just wonder why Pig produces theo output of a Join the way it does?
>>
>> Cheers,
>> Alex
>>
>>
>>
>

Re: Behavior of JOIN

Posted by Alan Gates <ga...@yahoo-inc.com>.

Historically

C = JOIN A by a, B by a

was defined in Pig Latin as shorthand for:

C1 = COGROUP A by a, B by a;
C = FOREACH C1 GENERATE flatten(A), flatten(B)

which produces the doubling of keys.

Also, given that Pig Latin does not require that key names be the same  
(as USING or NATURAL do in SQL) there would be issues if it did not  
have both keys in the output.  (For the same reason ON in SQL  
duplicates the keys in the results.)

Alan.

On Jun 8, 2010, at 4:45 AM, Alexander SchÃ¤tzle wrote:

> Hi all,
>
> the JOIN operator of Pig produces duplicate columns in its output.
> Let's say the statement is like this:
>
> C = JOIN A BY (var1, var2), B BY (var1, var2);
>
> Then C contains var1 and var2 two times (one for each input  
> relation), of course with the same content.
> This is somehow not what a user "usually" expects from a Join.
> Why does Pig produce such redundant entries?
> If you want to get rid of these entries you have to do a FOREACH for  
> projection.
> Otherwise you shuffle unnecessary data through MR-phases.
> In my opinion this is somehow really unnecessary.
> I just wonder why Pig produces theo output of a Join the way it does?
>
> Cheers,
> Alex
>
>

Re: Behavior of JOIN

Posted by Syed Wasti <md...@hotmail.com>.

Curious to know the answer too.
To add more to this duplicate columns, after the join when I do the FOREACH
for projection it errors out if the join condition fields have the same
name, pig doesn't know which field to pick.

Eg.  C = JOIN A BY (var1), B BY (var1);
     D = FOREACH C GENERATE var1, var2, var3;
You get the below error;
2010-06-08 11:19:49,396 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1025: Found more than one match: A::var1, B::var1

The work around for this would be;
C = JOIN A BY (var1), B BY (var4);
D = FOREACH C GENERATE var1, var2, var3;
And it works fine.
It just doesn't seem the efficient way.

On 6/8/10 4:45 AM, "Alexander SchÃ€tzle" <al...@yahoo.com>
wrote:

> Hi all,
> 
> the JOIN operator of Pig produces duplicate columns in its output.
> Let's say the statement is like this:
> 
> C = JOIN A BY (var1, var2), B BY (var1, var2);
> 
> Then C contains var1 and var2 two times (one for each input relation), of
> course with the same content.
> This is somehow not what a user "usually" expects from a Join.
> Why does Pig produce such redundant entries?
> If you want to get rid of these entries you have to do a FOREACH for
> projection.
> Otherwise you shuffle unnecessary data through MR-phases.
> In my opinion this is somehow really unnecessary.
> I just wonder why Pig produces theo output of a Join the way it does?
> 
> Cheers,
> Alex
> 
>