You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Josh Wills (JIRA)" <ji...@apache.org> on 2012/07/21 19:10:33 UTC

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419880#comment-13419880 ] 

Josh Wills commented on CRUNCH-23:
----------------------------------

We're going to need something along the lines of this:

http://chasebradford.wordpress.com/2010/12/12/reusable-total-order-sorting-in-hadoop/
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>         Attachments: SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira