You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Shravan Matthur Narayanamurthy (JIRA)" <ji...@apache.org> on 2008/09/12 22:29:44 UTC
[jira] Commented: (PIG-364) Limit return incorrect records when we use multiple reducer

    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630673#action_12630673 ] 

Shravan Matthur Narayanamurthy commented on PIG-364:
----------------------------------------------------

The patch seems to follow the approach mentioned, but I think there is a problem with this approach. I tested it both with and without the patch. The problem exists in both cases. Consider the following script:

A = load 'st10M';
B = limit A 10;
C = filter B by 2>1 parallel 10;
dump C;

I would expect to see only 10 tuples here. But I see 40 tuples here. The reason is that there were 4 mappers that produced 40 tuples in all and there were 10 reducers asked by filter since limit crosses map-reduce boundary. There was no limiting action done by the limit on the reduce side as there is capacity to pass 10*10=100 tuples.

The problem is that we cannot guarantee the parallelism by setting the requested parallelism to some value while visiting limit as it can get modified further down the line which is shown in the example. The same case happens even in the limter in the reducer case for the following script where I see 97 tuples instead of the expected 10. If everything went right (that is no tuples are clipped) this should have been 100 and the actual output is pretty close:

A = load 'st10M';
B1 = group A by $0 parallel 10;
B = limit B1 10;
C = filter B by 2>1 parallel 10;
dump C;

One way I could think of is to terminate the new reduce phase created with a store therby ensuring that the parallelism is ensured to be that set during limit. Then start a new MapReduceOper by loading this file. Another way is to disable further changes to the reduce parallelism by maintaining a flag which controls the changes to the parallelism of the map reduce operator. But this would mean that we disobey the user's request, which might be meaningless at some places. Also, the semantics of parallel will be distorted when the limit is in picture.

> Limit return incorrect records when we use multiple reducer
> -----------------------------------------------------------
>
>                 Key: PIG-364
>                 URL: https://issues.apache.org/jira/browse/PIG-364
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Daniel Dai
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: PIG-364.patch
>
>
> Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.