You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Zhang, Liyun" <li...@intel.com> on 2014/12/18 07:38:17 UTC

Is there any way to guarantee the sequence of “group” field as the input when using “group” operator in pig

Hi all,
   I met a problem that “group operator has different results in different engines like "spark" and "mapreduce"(PIG-4282<https://issues.apache.org/jira/browse/PIG-4282>).

groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int);
B = group A by age;
C = foreach B {
 D = A.gpa;
 E = distinct D;
generate group, MIN(E);
};
dump C;
input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field ‘group’ is not same.

Is there any way to guarantee the sequence of “group” field as the input when using “group” operator in pig?


Best regards
Zhang,Liyun

RE: Is there any way to guarantee the sequence of "group" field as the input when using "group" operator in pig

Posted by "Zhang, Liyun" <li...@intel.com>.

Hi Remi:
  Thanks for your reply. I agree that "group makes no guarantee by contract". The sequence of result is not same as the input. So we need make some changes in org.apache.pig.test.TestForEachNestedPlan.testInnerDistinct() and org.apache.pig.test.TestForEachNestedPlan.testInnerOrderByAliasReuse() . Because in those two functions, it judges the result of group according to the input sequence. I have submitted PIG-4282_1.patch. Can anyone help review? Very thanks

TestForEachNestedPlan.testInnerDistinct()  Line219:

            List<Tuple> expectedResults =
                Util.getTuplesFromConstantTupleStrings(
                        new String[] {"(10,68)", "(20,78)"});

            int counter = 0;
            while (iter.hasNext()) {   // judges the result of group according to the input sequence
               assertEquals(expectedResults.get(counter++).toString(),  
                        iter.next().toString());
            }

            assertEquals(expectedResults.size(), counter);




Best Regards
Zhang,Liyun



-----Original Message-----
From: remi.catherinot@orange.com [mailto:remi.catherinot@orange.com] 
Sent: Thursday, December 18, 2014 10:56 PM
To: dev@pig.apache.org
Subject: RE: Is there any way to guarantee the sequence of "group" field as the input when using "group" operator in pig

Hi all,

If you need any kind of ordering in the output you use on the "sort" operator. It was designed for such needs. The fact that different engines produce differently ordered groups is due to each engine specific optimizations. If you ask PIG to re-order the groups you just remove any benefit of those optimization. I would rather keep groups the way it is because I know I could rely on sort if I need and pay its price or have the best speed if I don't need any specific ordering.

My conclusion is : group makes no guarantee by contract, so this is neither a problem nor a bug. It is a misuse of "group" compared to "sort"

Regards,
Remi

-----Message d'origine-----
De : Zhang, Liyun [mailto:liyun.zhang@intel.com] Envoyé : jeudi 18 décembre 2014 07:38 À : pig-dev@hadoop.apache.org Objet : Is there any way to guarantee the sequence of "group" field as the input when using "group" operator in pig

Hi all,
   I met a problem that "group operator has different results in different engines like "spark" and "mapreduce"(PIG-4282<https://issues.apache.org/jira/browse/PIG-4282>).

groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int); B = group A by age; C = foreach B {  D = A.gpa;  E = distinct D; generate group, MIN(E); }; dump C; input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field 'group' is not same.

Is there any way to guarantee the sequence of "group" field as the input when using "group" operator in pig?


Best regards
Zhang,Liyun


_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

RE: Is there any way to guarantee the sequence of "group" field as the input when using "group" operator in pig

Posted by re...@orange.com.

Hi all,

If you need any kind of ordering in the output you use on the "sort" operator. It was designed for such needs. The fact that different engines produce differently ordered groups is due to each engine specific optimizations. If you ask PIG to re-order the groups you just remove any benefit of those optimization. I would rather keep groups the way it is because I know I could rely on sort if I need and pay its price or have the best speed if I don't need any specific ordering.

My conclusion is : group makes no guarantee by contract, so this is neither a problem nor a bug. It is a misuse of "group" compared to "sort"

Regards,
Remi

-----Message d'origine-----
De : Zhang, Liyun [mailto:liyun.zhang@intel.com]
Envoyé : jeudi 18 décembre 2014 07:38
À : pig-dev@hadoop.apache.org
Objet : Is there any way to guarantee the sequence of "group" field as the input when using "group" operator in pig

Hi all,
I met a problem that "group operator has different results in different engines like "spark" and "mapreduce"(PIG-4282<https://issues.apache.org/jira/browse/PIG-4282>).

groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int); B = group A by age; C = foreach B { D = A.gpa; E = distinct D; generate group, MIN(E); }; dump C; input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field 'group' is not same.

Is there any way to guarantee the sequence of "group" field as the input when using "group" operator in pig?

Best regards
Zhang,Liyun

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

Re: Is there any way to guarantee the sequence of “group” field as the input when using “group” operator in pig

Posted by Rohini Palaniswamy <ro...@gmail.com>.

I see that the jira is for unit tests and not e2e test. Please use

Util.checkQueryOutputsAfterSort(iter, expectedResults);

-Rohini

On Mon, Dec 22, 2014 at 6:39 PM, Rohini Palaniswamy <rohini.aditya@gmail.com
> wrote:
>
> Usually I have been fixing these kinds of tests by adding an order by when
> I added new tests for Union for Tez. In this case you can add order by
> after the distinct in the nested foreach.
>
> Daniel,
>     Any better suggestions?
>
> Regards,
> Rohini
>
>
> On Wed, Dec 17, 2014 at 10:38 PM, Zhang, Liyun <li...@intel.com>
> wrote:
>>
>> Hi all,
>>    I met a problem that “group operator has different results in
>> different engines like "spark" and "mapreduce"(PIG-4282<
>> https://issues.apache.org/jira/browse/PIG-4282>).
>>
>> groupdistinct.pig
>> A = load 'input1.txt' as (age:int,gpa:int);
>> B = group A by age;
>> C = foreach B {
>>  D = A.gpa;
>>  E = distinct D;
>> generate group, MIN(E);
>> };
>> dump C;
>> input1.txt is:
>> 10 89
>> 20 78
>> 10 68
>> 10 89
>> 20 92
>> the mapreduce output is:
>> (10,68),(20,78)
>> the spark output is
>> (20,78),(10,68)
>> These two results are different, because the sequence of field ‘group’ is
>> not same.
>>
>> Is there any way to guarantee the sequence of “group” field as the input
>> when using “group” operator in pig?
>>
>>
>> Best regards
>> Zhang,Liyun
>>
>>

Re: Is there any way to guarantee the sequence of “group” field as the input when using “group” operator in pig

Posted by Rohini Palaniswamy <ro...@gmail.com>.

Usually I have been fixing these kinds of tests by adding an order by when
I added new tests for Union for Tez. In this case you can add order by
after the distinct in the nested foreach.

Daniel,
    Any better suggestions?

Regards,
Rohini

On Wed, Dec 17, 2014 at 10:38 PM, Zhang, Liyun <li...@intel.com>
wrote:
>
> Hi all,
>    I met a problem that “group operator has different results in different
> engines like "spark" and "mapreduce"(PIG-4282<
> https://issues.apache.org/jira/browse/PIG-4282>).
>
> groupdistinct.pig
> A = load 'input1.txt' as (age:int,gpa:int);
> B = group A by age;
> C = foreach B {
>  D = A.gpa;
>  E = distinct D;
> generate group, MIN(E);
> };
> dump C;
> input1.txt is:
> 10 89
> 20 78
> 10 68
> 10 89
> 20 92
> the mapreduce output is:
> (10,68),(20,78)
> the spark output is
> (20,78),(10,68)
> These two results are different, because the sequence of field ‘group’ is
> not same.
>
> Is there any way to guarantee the sequence of “group” field as the input
> when using “group” operator in pig?
>
>
> Best regards
> Zhang,Liyun
>
>