You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by zhang jianfeng <zj...@gmail.com> on 2009/04/16 05:41:10 UTC

Does pig insure that the two bags are int the same order?

My Scripts:

B = FOREACH A GENERATE f1,f2,f3;
C = GROUP B BY f1;
D = FOREACH C GENERATE group, myudf(C.f2,C.f3);

My question is: Are C.f2 and C.f3 in the same order?

I mean I want iterate C.f2 and C.f3, so I want to make sure that the n-th
item in C.f2 and the n-th item in C.f3 are int the same tuple.


Does any know that?

Thank you.

Jeff Zhang

Re: Does pig insure that the two bags are int the same order?

Posted by zhang jianfeng <zj...@gmail.com>.
Thank you for your reply.

This is very helpful to me.



On Thu, Apr 16, 2009 at 4:01 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
>
> There are a couple of errors in the script.
> After
> C = GROUP B by f1;
>
> the schema would be - group, B:{(f1, f2, f3)}
>
> That is, each tuple would have the first ($0) field called group (which
> corresponds to your group key == f1), and a second bag called B - which
> contains all the tuples in B which matched the group key.
>
>
> So to access f1, f2, etc - you have to access it through B (B.f1) and not
> directly as f1, etc.
>
>
>
> Secondly, B.f1 would give you a bag which projects out all f1's from the B
> bag - similarly B.f2 would project out all f2's from the B bag.
>
> So modifying below to :
> D = FOREACH C GENERATE group, myudf(B.f2, B.f3);
>
> would give your udf two bags as input - each containing a bag of
> corresponding fields - with no correlation between the two bags (as in, you
> can tell which field in f2 bag corresponds to which field in f3).
>
> You could instead use :
>
> D = FOREACH C GENERATE group, myudf(B.(f2,f3));
>
> This will give the udf a bag which contains two fields - f2 and f3 from the
> same tuple in B (and so corresponding fields).
>
>
> Hope this helps.
> Regards,
> Mridul
>
>
>
>
>
> zhang jianfeng wrote:
>
>> My Scripts:
>>
>> B = FOREACH A GENERATE f1,f2,f3;
>> C = GROUP B BY f1;
>> D = FOREACH C GENERATE group, myudf(C.f2,C.f3);
>>
>> My question is: Are C.f2 and C.f3 in the same order?
>>
>> I mean I want iterate C.f2 and C.f3, so I want to make sure that the n-th
>> item in C.f2 and the n-th item in C.f3 are int the same tuple.
>>
>>
>> Does any know that?
>>
>> Thank you.
>>
>> Jeff Zhang
>>
>>
>

Re: Does pig insure that the two bags are int the same order?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

There are a couple of errors in the script.
After
C = GROUP B by f1;

the schema would be - group, B:{(f1, f2, f3)}

That is, each tuple would have the first ($0) field called group (which 
corresponds to your group key == f1), and a second bag called B - which 
contains all the tuples in B which matched the group key.


So to access f1, f2, etc - you have to access it through B (B.f1) and 
not directly as f1, etc.



Secondly, B.f1 would give you a bag which projects out all f1's from the 
B bag - similarly B.f2 would project out all f2's from the B bag.

So modifying below to :
D = FOREACH C GENERATE group, myudf(B.f2, B.f3);

would give your udf two bags as input - each containing a bag of 
corresponding fields - with no correlation between the two bags (as in, 
you can tell which field in f2 bag corresponds to which field in f3).

You could instead use :

D = FOREACH C GENERATE group, myudf(B.(f2,f3));

This will give the udf a bag which contains two fields - f2 and f3 from 
the same tuple in B (and so corresponding fields).


Hope this helps.
Regards,
Mridul




zhang jianfeng wrote:
> My Scripts:
> 
> B = FOREACH A GENERATE f1,f2,f3;
> C = GROUP B BY f1;
> D = FOREACH C GENERATE group, myudf(C.f2,C.f3);
> 
> My question is: Are C.f2 and C.f3 in the same order?
> 
> I mean I want iterate C.f2 and C.f3, so I want to make sure that the n-th
> item in C.f2 and the n-th item in C.f3 are int the same tuple.
> 
> 
> Does any know that?
> 
> Thank you.
> 
> Jeff Zhang
>