You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Josh Wills (JIRA)" <ji...@apache.org> on 2013/03/14 06:16:13 UTC

[jira] [Updated] (CRUNCH-51) PCollection#sort relies on using a single reducer for total order sorting

     [ https://issues.apache.org/jira/browse/CRUNCH-51?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Wills updated CRUNCH-51:
-----------------------------

    Attachment: CRUNCH-51.patch

Here's my (still incomplete/ugly) take on this, based on using the reservoir sampling stuff that was just added and the notion of dependencies across Crunch jobs that we introduced for mapside joins. I'm not sure I'm ready for a review yet, but wanted to get this posted in case I get hit by a bus.
                
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-51
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-51
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>         Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch, CRUNCH-51.patch, CRUNCH-51.patch, SortTest.java
>
>
> The total-order sorting provided by the Sort class (and therefore PCollection#sort) relies on using a single reducer in order to provide total-order sorting. This is very inefficient for large datasets, and should be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira