You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2008/03/14 17:24:25 UTC

[jira] Created: (HADOOP-3019) want input sampler & sorted partitioner

want input sampler & sorted partitioner
---------------------------------------

                 Key: HADOOP-3019
                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
            Reporter: Doug Cutting


The input sampler should generate a small, random sample of the input, saved to a file.

The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.

Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Fix Version/s: 0.19.0
         Assignee: Chris Douglas
           Status: Patch Available  (was: Open)

This won't compile until HADOOP-4151 is in, but marking it PA for review.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3019:
------------------------------------

    Release Note: Added a partitioner that effects a total order of output data, and an input sampler for generating the partition keyset for TotalOrderPartitioner for when the map's input keytype and distribution approximates its output.  (was: Adds a partitioner capable of effecting a total order of output data. Also includes an input sampler for generating the partition keyset for TotalOrderPartitioner, useful where the map's input keytype and distribution approximates its output.)

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579347#action_12579347 ] 

Enis Soztutar commented on HADOOP-3019:
---------------------------------------

The sampler can be easily written once Filters are in(HADOOP-449). I intent to come up with a patch today. 

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Attachment: 3019-1.patch

Updated patch to refer to BinaryComparable instead of MemComparable and moved the change to bin/hadoop to the correct patch.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Attachment: 3019-5.patch

More updates on Owen's feedback:
* RandomSampler includes the selected element when selecting
* Validate ordering of partition file when configuring TotalOrderPartitioner

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Status: Patch Available  (was: Open)

Submitting last patch to make 0.19.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631664#action_12631664 ] 

Runping Qi commented on HADOOP-3019:
------------------------------------


Sorry for jump in late.

Since the sample points are kept in array and sorted in memory, then its size is severely limited.
Why not consider to use a map reduce job to generate the sampling points and the partition file?



> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578899#action_12578899 ] 

Amar Kamat commented on HADOOP-3019:
------------------------------------

Should this be a part of examples like sort? Users can use it the way they use other examples.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633309#action_12633309 ] 

Hudson commented on HADOOP-3019:
--------------------------------

Integrated in Hadoop-trunk #611 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/])

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631175#action_12631175 ] 

Chris Douglas commented on HADOOP-3019:
---------------------------------------

Results of test-patch with HADOOP-4151 applied:
{noformat}
     [exec] +1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 18 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
{noformat}

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Attachment: 3019-0.patch

Adapted TotalOrderPartitioner and input sampler from HADOOP-3402, with the following changes:
* Adds two other kinds of samplers
* Made memcmp-able types (Text, BytesWritable) use the trie, other data structures do a binary search over the partition keyset
* Adds a unit test for the partitioner

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>         Attachments: 3019-0.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Attachment: 3019-4.patch

Fixed a javadoc warning (ant javadoc target didn't know about tools)

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631568#action_12631568 ] 

Hadoop QA commented on HADOOP-3019:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12390146/3019-1.patch
  against trunk revision 696002.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3276/console

This message is automatically generated.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Attachment: 3019-2.patch

Changed random sampling to be less dominated by keys in the latter part of each sampled split.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632871#action_12632871 ] 

Chris Douglas commented on HADOOP-3019:
---------------------------------------

test-patch results for 3019-4:
{noformat}
     [exec] +1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 5 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
{noformat}

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Status: Patch Available  (was: Open)

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Status: Open  (was: Patch Available)

bq. Since the sample points are kept in array and sorted in memory, then its size is severely limited. Why not consider to use a map reduce job to generate the sampling points and the partition file?

The client-side sampler is limited, no question, but it usually only takes a few seconds to run (unlike a distributed job), generates decent results, and can be easily rolled into the user's driver. The distributed sampler (planned, writing it) can be more accurate, but will take longer. The client-side sampler also needs to use the map class, so the sampling is on the map output keytype and distribution rather than the input.

The latter requires that most of the InputSampler be rewritten to use MapRunnable, so I'm cancelling this for now.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Attachment: 3019-3.patch

Updated based on Owen's feedback:
* Changed RandomSampler to sample evenly across all splits, rather than evenly from each split
* Used double instead of float for sampling rate

This patch also modifies the sort example to demonstrate how to use InputSampler in a job. This requires examples to depend on tools.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578840#action_12578840 ] 

Doug Cutting commented on HADOOP-3019:
--------------------------------------

Implementation thoughts:
 - the sampler can be implemented as an inputformat.
 - a generic sampling job class can configure the sampling input format and a single identity reducer.
 - samples should come from random positions in input files, since input files are frequently themselves sorted.



> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-3019:
----------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Chris!

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Status: Open  (was: Patch Available)

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578915#action_12578915 ] 

Doug Cutting commented on HADOOP-3019:
--------------------------------------

> Should this be a part of examples like sort?

I guess  it could live with the examples, but I was thinking that this would be more like mapred/lib.  The sampler should be generic enough that folks won't have to modify it to find it useful: it should work for different key/value types and for sequencefile and text data.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3019) want input sampler & sorted partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3019:
----------------------------------

    Release Note: Adds a partitioner capable of effecting a total order of output data. Also includes an input sampler for generating the partition keyset for TotalOrderPartitioner, useful where the map's input keytype and distribution approximates its output.

Added a release note.

> want input sampler & sorted partitioner
> ---------------------------------------
>
>                 Key: HADOOP-3019
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3019
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3019-0.patch, 3019-1.patch, 3019-2.patch, 3019-3.patch, 3019-4.patch, 3019-5.patch
>
>
> The input sampler should generate a small, random sample of the input, saved to a file.
> The partitioner should read the sample file and partition keys into relatively even-sized key-ranges, where the partition numbers correspond to key order.
> Note that when the sampler is used for partitioning, the number of samples required is proportional to the number of reduce partitions.  10x the intended reducer count should give good results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.