You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org> on 2008/09/18 22:12:44 UTC

[jira] Issue Comment Edited: (PIG-364) Limit return incorrect records when we use multiple reducer

    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632365#action_12632365 ] 

shravanmn edited comment on PIG-364 at 9/18/08 1:11 PM:
-----------------------------------------------------------------------------

Consider the following script:
{noformat}
a = load 'file:/etc/passwd';
b = limit a 10;
c = filter b by 2>1 parallel 10;
split c into c1 if 2>1, c2 if 2>1;
d = group c1 by $0;
e = group c2 by $0;
f = group d by $0, e by $0;
dump f;
{noformat}

This is a case where, multiple MROps are generated at the split as shown in the figure below, if what I understand from the code is right.

!https://issues.apache.org/jira/secure/attachment/12390412/limitsplit.png!

Now when the job controller sees this graph of MROps, it first schedules the LD MROp. To remind you, the limitadjuster has now changed the output of this to some temporary file. At this point, the controller has an option to schedule both the Lim Adj Op and the free 2-LRs Op whose dependency has been just resolved. If at all the choice is to execute the 2-LRs oP it tries to read the original output of the split which doesn't exist since the Lim Adj Op hasn't run yet and will fail. However if it decides to choose the Lim Adj Op, things will go fine.

In order to avoid this, we need to make sure to disconnect all the successors and make the Lim Adj Op their predecessor and connect Lim Adj Op to LD as indicated in the figure.

Let me know if I my understanding is wrong.

      was (Author: shravanmn):
    Consider the following script:
{noformat}
a = load 'file:/etc/passwd';
b = limit a 10;
c = filter b by 2>1 parallel 10;
split c into c1 if 2>1, c2 if 2>1;
d = group c1 by $0;
e = group c2 by $0;
f = group d by $0, e by $0;
dump f;
{noformat}

This is a case where, multiple MROps are generated at the split as shown in the figure below, if what I understand from the code is right.

!https://issues.apache.org/jira/secure/attachment/12390410/limitsplit.png!

Now when the job controller sees this graph of MROps, it first schedules the LD MROp. To remind you, the limitadjuster has now changed the output of this to some temporary file. At this point, the controller has an option to schedule both the Lim Adj Op and the free 2-LRs Op whose dependency has been just resolved. If at all the choice is to execute the 2-LRs oP it tries to read the original output of the split which doesn't exist since the Lim Adj Op hasn't run yet and will fail. However if it decides to choose the Lim Adj Op, things will go fine.

In order to avoid this, we need to make sure to disconnect all the successors and make the Lim Adj Op their predecessor and connect Lim Adj Op to LD as indicated in the figure.

Let me know if I my understanding is wrong.
  
> Limit return incorrect records when we use multiple reducer
> -----------------------------------------------------------
>
>                 Key: PIG-364
>                 URL: https://issues.apache.org/jira/browse/PIG-364
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: types_branch
>
>         Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch
>
>
> Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.