You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2015/02/04 22:34:34 UTC

[jira] [Updated] (PIG-4392) RANK BY fails when default_parallel is greater than cardinality of field being ranked by

     [ https://issues.apache.org/jira/browse/PIG-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-4392:
----------------------------
    Attachment: PIG-4392-2.patch

Thanks for the review. I found another issue when working on the patch. The sort job does not take the parallelism of rank operator. The new patch also fix this issue. The order reverse is caused by the logic Pig combining small splits. Small splits are reversely sorted by the part file size. In this case, part-00003 is the largest and part-00001 is the smallest. That's why the tuple in part-00003 (7 8 9) appears first.

I also find some issue in tez mode. Will create a separate ticket for it.

> RANK BY fails when default_parallel is greater than cardinality of field being ranked by
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-4392
>                 URL: https://issues.apache.org/jira/browse/PIG-4392
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11.1
>            Reporter: Anthony Hsu
>            Assignee: Daniel Dai
>             Fix For: 0.15.0
>
>         Attachments: PIG-4392-1.patch, PIG-4392-2.patch
>
>
> To reproduce:
> {code:title=input.txt}
> 1 2 3
> 4 5 6
> 7 8 9
> {code}
> {code:title=rank.pig}
> set default_parallel 4;
> d = load 'input.txt' using PigStorage(' ') as (a:int, b:int, c:int);
> e = rank d by a;
> dump e;
> {code}
> If {{default_parallel}} is set to {{3}}, the script succeeds. So I'm guessing RANK BY has issues if the {{default_parallel}} exceeds the cardinality of the field being ranked by.
> I'm seeing this issue with Pig 0.11.1 (which has the PIG-2932 patch applied) and Hadoop 2.3.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)