You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/09/16 00:39:44 UTC

[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Attachment: 3019-0.patch

Adapted TotalOrderPartitioner and input sampler from HADOOP-3402, with the following changes:
* Adds two other kinds of samplers
* Made memcmp-able types (Text, BytesWritable) use the trie, other data structures do a binary search over the partition keyset
* Adds a unit test for the partitioner

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>         Attachments: 3019-0.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.