You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Mohit Sabharwal (JIRA)" <ji...@apache.org> on 2015/05/09 00:31:01 UTC

[jira] [Commented] (PIG-4542) OutputConsumerIterator should flush buffered records

    [ https://issues.apache.org/jira/browse/PIG-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535769#comment-14535769 ] 

Mohit Sabharwal commented on PIG-4542:
--------------------------------------

FYI [~kellyzly]. [~praveenr019], [~xuefuz], this patch:

- Gets rid of the use of RDD.count() in CollectedGroupConverter and StreamConverter.
- Adds an abstract method endInput() which is executed before we call getNextTuple() for the last time.
- Deletes POStreamSpark since it was just handling the last record.
- Removes special code in POCollectedGroup to handle the last record.
- While I was here, renamed POOutputConsumerIterator to OutputConsumerIterator. The "PO" prefix should only be used for physical operators.

> OutputConsumerIterator should flush buffered records
> ----------------------------------------------------
>
>                 Key: PIG-4542
>                 URL: https://issues.apache.org/jira/browse/PIG-4542
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>    Affects Versions: spark-branch
>            Reporter: Mohit Sabharwal
>            Assignee: Mohit Sabharwal
>             Fix For: spark-branch
>
>         Attachments: PIG-4542.patch
>
>
> Certain operators may buffer the output. We need to flush the last set of records from such operators, when we encounter the last input record, before calling getNextTuple() for the last time.
> Currently, to flush the last set of records, we compute RDD.count() and compare the count with a running counter to determine if we have reached the last record. This is an unnecessary and inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)