You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Gabriel Reid (JIRA)" <ji...@apache.org> on 2012/07/21 13:39:33 UTC

[jira] [Created] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Gabriel Reid created CRUNCH-23:
----------------------------------

             Summary: PCollection#sort doesn't do a full sort on values
                 Key: CRUNCH-23
                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
             Project: Crunch
          Issue Type: Bug
            Reporter: Gabriel Reid


When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436546#comment-13436546 ] 

Rahul Sharma commented on CRUNCH-23:
------------------------------------

Yes the patch is the full cumulative patch.  It works on SequenceFiles only. Basically in the end the TotalOrderPartioner requires a SequenceFile with sorted keys and null values. We need keys for this file, for which we need to materialize an intermediate collection, and I think when we do so then crunch creates intermediate output files which are SequenceFiles only. I am not completely sure on this but till now I haven't seen a case where it creates a different type out intermediate output file. Also I haven't tested it with Avro files.
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rahul Sharma reassigned CRUNCH-23:
----------------------------------

    Assignee: Rahul Sharma
    
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440956#comment-13440956 ] 

Rahul Sharma commented on CRUNCH-23:
------------------------------------

TotalOrderPartitioner  in the current form is not usable with Avro. MAPREDUCE-4574 issue states the same. We will need to re implement the TotalOrderPartitioner if we want to use it.

But on second thought do we want this work with avro data ? In avro  the sort order is imposed by the Schema. So if the user specifies some order in the schema then Avro will make sure it loads all data using the same. If none is specified then avro will select ascending order by default on each of the fields of the record. It feels like avro data is sorted out-of the box.  
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Gabriel Reid (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriel Reid updated CRUNCH-23:
-------------------------------

    Attachment: SortTest.java

Example to demonstrate the issue. Note that this issue will not be visible when run in a local jobrunner, as the local jobrunner only uses a single reducer.
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>         Attachments: SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rahul Sharma updated CRUNCH-23:
-------------------------------

    Attachment: CRUNCH-23-sorting-issue.patch

Changes have been made to the Pipeline and PCollection APIs to incorporate the fix. 
Some test cases have been written for the partition file but I can not make a test for the Sort API. It has to be tested on a cluster.  
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Gabriel Reid (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440615#comment-13440615 ] 

Gabriel Reid commented on CRUNCH-23:
------------------------------------

Hmm, no definite suggestions for now. I'll have to take a closer look at it, but that won't be for a couple of days unfortunately.
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Gabriel Reid (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436211#comment-13436211 ] 

Gabriel Reid commented on CRUNCH-23:
------------------------------------

I was just going to take a look into this as well -- and I've got a couple of questions. Is the patch CRUNCH-23-sorting-issue.patch the full cumulative patch? Also, I just took a quick look at it, and it appears that it might be reliant on using SequenceFiles (and therefore it wouldn't work with Avro) -- any idea if this is the case?
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Josh Wills (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428185#comment-13428185 ] 

Josh Wills commented on CRUNCH-23:
----------------------------------

Thanks man, I'll take a look at it this weekend. I think the reservoir sampling approach should work-- either a) the data isn't very large, and so the quality of the sample won't matter that much, or b) the data is large and the sample will be good w/high probability.
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428993#comment-13428993 ] 

Rahul Sharma commented on CRUNCH-23:
------------------------------------

Josh, I don't know how can you test this in build. I tested the solution by running attached SortTest on a cluster.

The solution would fail a job if there is a single node and we try to run it on multiple reducers by looking at the bytes.per.reduce.task. In such a case it throws back exception saying :

Caused by: java.io.IOException: Wrong number of partitions in keyset
    at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:82)
    ... 6 more 

Also there is code that we can get into MRPipeline like creation of sample. I can do some improvements on that side. 
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Josh Wills (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425820#comment-13425820 ] 

Josh Wills commented on CRUNCH-23:
----------------------------------

Hey Rahul-- take another look at the link I posted, it describes a strategy for sampling from the input keys and balancing the distribution in the reducer.
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430959#comment-13430959 ] 

Rahul Sharma commented on CRUNCH-23:
------------------------------------

The patch has been tested using the attached SortTest. The solution works fine but I feel with the words.txt file the distribution is not perfect. It pushes relatively larger number of data on the first node.
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Josh Wills (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428965#comment-13428965 ] 

Josh Wills commented on CRUNCH-23:
----------------------------------

Hey Rahul, two thoughts:

1) Do you have a testing strategy in mind to verify that this really works?
2) Is there a way we can move the MRPipeline-specific logic into the MRPipeline class and have the o.a.crunch.lib.Sort stuff do all of the work via calls to methods on the Pipeline interface?
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424718#comment-13424718 ] 

Rahul Sharma edited comment on CRUNCH-23 at 7/30/12 6:38 AM:
-------------------------------------------------------------

This is a first cut solution to this issue. But this solution suffers from a drawback. The keys in the partition file are not evenly distributed. In the worst case i.e if the file is already sorted as the words.txt, most of the work is done by the last reducer.  Is there a way of improving this ?

Also I donno if the same problem is there in other sorting of Ptable/Pairs etc. I could not create a test case for the same. All the tests eventually ran on the PCollection sort API. 
                
      was (Author: rahul.sharma):
    This is a first cut solution to this issues. But this solution suffers from a drawback. The keys in the partition file are not evenly distributed. In the worst case i.e if the file is sort the most of the work is done by the last reducer.Is there a way of improving this ?

Also I donno if the same problem is there in other sorting of Ptable/Pairs etc. I could not create a test case for the same. All the tests eventually ran on the PCollection sort API. 
                  
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440217#comment-13440217 ] 

Rahul Sharma commented on CRUNCH-23:
------------------------------------

I am able to create sequence files for avro data, with AvroKey as the key class. When it is read back in TotalOrderPartioner then it back exceptions as it expects the key to be of type WritableComparable :

java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast to org.apache.hadoop.io.WritableComparable
	at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:295)
	at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:80)

Any suggestions ?
	
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Josh Wills (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419880#comment-13419880 ] 

Josh Wills commented on CRUNCH-23:
----------------------------------

We're going to need something along the lines of this:

http://chasebradford.wordpress.com/2010/12/12/reusable-total-order-sorting-in-hadoop/
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>         Attachments: SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440170#comment-13440170 ] 

Rahul Sharma commented on CRUNCH-23:
------------------------------------

Gabriel, I check the patch for avro files. It does not work. My bad, should have verified it earlier. Also while fixing it I am getting stuck at a point. In the end the TotalOrdePartioner requires a SequenceFile. How can I make one using the keys from Avro data? Still trying out a few options e.g. configuring AvroSequenceFileOutputFormat.

                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-sorting-issue.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rahul Sharma updated CRUNCH-23:
-------------------------------

    Attachment: CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch

This is a first cut solution to this issues. But this solution suffers from a drawback. The keys in the partition file are not evenly distributed. In the worst case i.e if the file is sort the most of the work is done by the last reducer.Is there a way of improving this ?

Also I donno if the same problem is there in other sorting of Ptable/Pairs etc. I could not create a test case for the same. All the tests eventually ran on the PCollection sort API. 
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>         Attachments: CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CRUNCH-23) PCollection#sort doesn't do a full sort on values

Posted by "Rahul Sharma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rahul Sharma updated CRUNCH-23:
-------------------------------

    Attachment: 0001-CRUNCH-23-fix-sorting.patch

Josh, I have implemented the same solution. But I feel that the data distribution is not perfect there. 

The solution is based on reservoir sampling. So the keys that are used in the Partioner are selected from a subset of the data. It solely depends on how good is the subset sample.  
                
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 0001-CRUNCH-23-fix-sorting.patch, CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is performed is only per reducer, and not an absolute sort over all values. This means that the values are not in sorted order if they are iterated over on a materialized collection. It also means that the sorted files that are output from a sort operation can not be simply concatenated to come to a single sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira