You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Dick King (JIRA)" <ji...@apache.org> on 2009/05/01 00:21:30 UTC

[jira] Commented: (PIG-791) reducing number of MR stages with ORDER BY

    [ https://issues.apache.org/jira/browse/PIG-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704815#action_12704815 ] 

Dick King commented on PIG-791:
-------------------------------

I am considering a modification to hadoop that would allow users to designate that a map/reduce output is a: sorted, b: likely to be the input to some other map/reduce where selected keys are re-emitted by the second mapper unchanged, with probability not correlated by the ordering, and c: the same sort order is used in the second map/reduce.

It would work by writing a sample file as a secondary output of the mapper in the first map/reduce.

This proposal in on my back burner, but could potentially be moved up.

Would that functionality be generally useful here?




> reducing number of MR stages with ORDER BY
> ------------------------------------------
>
>                 Key: PIG-791
>                 URL: https://issues.apache.org/jira/browse/PIG-791
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>
> When an order by is not the only operation in a pig script, it is done in two additional MR jobs. The first job samples using a sampling loader, the second does the sort. The sample is used to construct a partitioner that equally balances the data in the sort. The sampler needs to be changed to be a EvalFunc instead of a loader. This way a split can be but in the proceeding MR job, with the main data being written out and the other part flowing to the sampler func, which can then write out the sample. The final MR job can then be the sort. 
> This change depends on multiquery code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.