You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2011/03/04 22:51:46 UTC
[jira] Resolved: (PIG-841) PERFORMANCE: The sample MR job in order
by (or joins which require sampling) implementation can use Hadoop sorting
instead of doing a POSort
[ https://issues.apache.org/jira/browse/PIG-841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich resolved PIG-841.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.8.0
> PERFORMANCE: The sample MR job in order by (or joins which require sampling) implementation can use Hadoop sorting instead of doing a POSort
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PIG-841
> URL: https://issues.apache.org/jira/browse/PIG-841
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.3.0
> Reporter: Pradeep Kamath
> Fix For: 0.8.0
>
>
> Currently the sample map reduce job in order by implementation does the following:
> - sample 100 records from each map
> - group all on the above output
> - sort the output bag from the above grouping on keys of the order by
> - give the sorted bag to FindQuantiles udf
> The steps 2 and 3 above can be replaced by
> - group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira