You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Klaas Bosteels (JIRA)" <ji...@apache.org> on 2009/03/18 17:42:50 UTC

[jira] Created: (HADOOP-5528) Binary partitioner

Binary partitioner
------------------

                 Key: HADOOP-5528
                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
            Reporter: Klaas Bosteels
            Assignee: Klaas Bosteels


It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5528:
----------------------------------

    Status: Open  (was: Patch Available)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695109#action_12695109 ] 

Konstantin Shvachko commented on HADOOP-5528:
---------------------------------------------

+1.
The patch takes care of the warnings, the redundant test.
New deprecated class is transitional.
Same for the findbug warning.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694767#action_12694767 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

It's not clear to me how you fixed the warning in the unit test, Nicholas. The unit test in your patch seems to be identical to the corresponding one in the patch that got committed.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Status: Patch Available  (was: Open)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-5528:
-------------------------------------------

    Attachment: 5528_20090401.patch

5528_20090401.patch: fixed all unchecked warning introduced by the previous patch and removed the duplicated test.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Status: Patch Available  (was: Open)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Status: Patch Available  (was: Open)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689165#action_12689165 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

Any further comments on the latest patch?

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688377#action_12688377 ] 

Hadoop QA commented on HADOOP-5528:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12403429/HADOOP-5528.patch
  against trunk revision 756858.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/119/console

This message is automatically generated.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Attachment: HADOOP-5528.patch

Apparently I didn't read Owen's proposal carefully enough, Chris. I thought he suggested using the same indices as python, instead of similar but slightly different ones. I think I finally understood what you guys meant now, though, and the new patch implements this. My only concern with this is that the slight differentiation from Python might confuse some programmers, but the javadoc explains it clearly so I guess that shouldn't be much of an issue.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5528:
----------------------------------

    Status: Open  (was: Patch Available)

This is looking good.

* Instead of the Indexer classes, this can use {{(index + key.getLength()) % key.getLength()}} for both positive and negative values
* The right index should probably default to \-1
* It's fair to assume that a user specifying an offset into binary data is certain that its length can accommodate it, so the max/min games preventing exceptions are probably too permissive.
* The semantics as written for the unit tests confuse me. For example, given
{noformat}
x = { 1, 2, 3, 4, 5 };
y = { 6, 2, 3, 7, 8 };
{noformat}
The test asserts that the left, right offsets {{1, -2}} and {{1, 3}} refer to the same elements, but this seems incorrect, because then {{3, -1}} would not refer to the last two bytes, but to the same byte. There would be no way to specify "the last n" bytes because the last byte is unaddressable.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Status: Open  (was: Patch Available)

Discovered a small mistake ("{{Math.min(offset, length - 1)}}" should really be "{{Math.min(offset, length)}}"). Will submit a corrected patch soon.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688538#action_12688538 ] 

Hadoop QA commented on HADOOP-5528:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12403429/HADOOP-5528.patch
  against trunk revision 757459.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/121/console

This message is automatically generated.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Status: Patch Available  (was: In Progress)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690152#action_12690152 ] 

Hadoop QA commented on HADOOP-5528:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12403798/HADOOP-5528.patch
  against trunk revision 759321.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/150/console

This message is automatically generated.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689842#action_12689842 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

Chris,

* 3:-1 does indeed not refer to the last two bytes, but
** that's how it works in Python as well:
{code}
>>> l = [1,2,3,4,5]
>>> l[1:-2], l[1:3], l[3:-1]
([2, 3], [2, 3], [4])
{code}
** you can specify "the last n" bytes by setting only the left offset (because {{LastIndexer}} is the default right indexer), which is also how you do it in Python:
{code}
>>> l[-2:]
[4, 5]
{code}
** because of the {{min}} in the {{PosOffsetIndexer}}, you can also just set the right offset to a large enough number to get "the last n" bytes.
* I don't think that -1 should be the default right offset, since that would mean that the last byte is ignored by default.
* It might indeed be possible to use {{(index + key.getLength()) % key.getLength()}} for both negative and positive offsets, but we need a separate indexer to implement the default right index anyway, and I think it makes sense to minimize the required computations by using more specialized indexers.

So, personally, I think that:

* we need the indexer classes (and cannot use -1 as default right index),
* the max/min games are useful (and not merely a way of preventing exceptions),
* the semantics are correct,

which leaves me with nothing to change in the latest patch *smile* Can you agree with this, or is there still something you want me to change nevertheless?

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-5528:
-------------------------------------------

    Hadoop Flags:   (was: [Reviewed])
          Status: Patch Available  (was: Reopened)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-5528:
----------------------------------

    Status: Open  (was: Patch Available)

I think this would be much clearer if the attribute names were:

mapred.binary.partitioner.offset
mapred.binary.partitioner.length

especially if you used max(key.length, mapred.binary.partitioner.length) so that you could set it to a large value and use the entire key.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695450#action_12695450 ] 

Hudson commented on HADOOP-5528:
--------------------------------

Integrated in Hadoop-trunk #796 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/796/])
    . Add a configurable hash partitioner operating on ranges of BinaryComparable keys. Contributed by Klaas Bosteels.


> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688268#action_12688268 ] 

Klaas Bosteels edited comment on HADOOP-5528 at 3/27/09 4:01 AM:
-----------------------------------------------------------------

The revised patch allows the subarray to be defined by means of Python-style offsets:

* {{mapred.binary.partitioner.left.offset}}: left Python-style offset in array
* {{mapred.binary.partitioner.right.offset}}: right Python-style offset in array

The best way to remember how these offsets work is by thinking of them as indices pointing between the array elements, with the left edge of the first element numbered 0, e.g.:

{code}
. +---+---+---+---+---+
  | B | B | B | B | B |
  +---+---+---+---+---+
  0   1   2   3   4   5
 -5  -4  -3  -2  -1
{code}

 The first row of numbers gives the position of the offsets 0...5 in  the array; the second row gives the corresponding negative offsets. When _i_ and _j_ are specified as left and right offset, respectively, then all bytes between the edges labeled _i_ and _j_ are taken into account for the partitioning.
 
More generally, the indexing logic can now be customized by specifying the {{BinaryPartitioner.Indexer}} classes to be used via the following properties:

* {{mapred.binary.partitioner.left.indexer.class}}
* {{mapred.binary.partitioner.right.indexer.class}}

By default, {{FirstIndexer}} and {{LastIndexer}} are used (i.e. the whole byte array is taken into account for the hashing), and the offset properties trigger the usage of {{PosOffsetIndexer}} and/or {{NegOffsetIndexer}}, which implement the indexing by means of Python-style offsets.

      was (Author: klbostee):
    The revised patch allows the subarray to be defined by means of Python-style offsets:

* {{mapred.binary.partitioner.left.offset}}: left Python-style offset in array
* {{mapred.binary.partitioner.right.offset}}: right Python-style offset in array

As indicated by Owen, the best way to remember how these offsets work is by thinking of them as indices pointing between the array elements, with the left edge of the first element numbered 0, e.g.:

{code}
. +---+---+---+---+---+
  | B | B | B | B | B |
  +---+---+---+---+---+
  0   1   2   3   4   5
 -5  -4  -3  -2  -1
{code}

 The first row of numbers gives the position of the offsets 0...5 in  the array; the second row gives the corresponding negative offsets. When _i_ and _j_ are specified as left and right offset, respectively, then all bytes between the edges labeled _i_ and _j_ are taken into account for the partitioning.
 
More generally, the indexing logic can now be customized by specifying the {{BinaryPartitioner.Indexer}} classes to be used via the following properties:

* {{mapred.binary.partitioner.left.indexer.class}}
* {{mapred.binary.partitioner.right.indexer.class}}

By default, {{FirstIndexer}} and {{LastIndexer}} are used (i.e. the whole byte array is taken into account for the hashing), and the offset properties trigger the usage of {{PosOffsetIndexer}} and/or {{NegOffsetIndexer}}, which implement the indexing by means of Python-style offsets.
  
> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Attachment: HADOOP-5528.patch

The attached patch adds both {{org.apache.hadoop.mapred.lib.BinaryPartitioner}} and {{org.apache.hadoop.mapreduce.lib.partition.BinaryPartitioner}}, which both take into account the following configuration properties:
* {{mapred.binary.partitioner.left.offset}}: left offset in the keys' byte arrays (0 by default)
* {{mapred.binary.partitioner.right.offset}}: right offset in the keys' byte array (0 by default)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Attachment: HADOOP-5528.patch

The revised patch allows the subarray to be defined by means of Python-style offsets:

* {{mapred.binary.partitioner.left.offset}}: left Python-style offset in array
* {{mapred.binary.partitioner.right.offset}}: right Python-style offset in array

As indicated by Owen, the best way to remember how these offsets work is by thinking of them as indices pointing between the array elements, with the left edge of the first element numbered 0, e.g.:

{code}
. +---+---+---+---+---+
  | B | B | B | B | B |
  +---+---+---+---+---+
  0   1   2   3   4   5
 -5  -4  -3  -2  -1
{code}

 The first row of numbers gives the position of the offsets 0...5 in  the array; the second row gives the corresponding negative offsets. When _i_ and _j_ are specified as left and right offset, respectively, then all bytes between the edges labeled _i_ and _j_ are taken into account for the partitioning.
 
More generally, the indexing logic can now be customized by specifying the {{BinaryPartitioner.Indexer}} classes to be used via the following properties:

* {{mapred.binary.partitioner.left.indexer.class}}
* {{mapred.binary.partitioner.right.indexer.class}}

By default, {{FirstIndexer}} and {{LastIndexer}} are used (i.e. the whole byte array is taken into account for the hashing), and the offset properties trigger the usage of {{PosOffsetIndexer}} and/or {{NegOffsetIndexer}}, which implement the indexing by means of Python-style offsets.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689875#action_12689875 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

Husdon seems to be rather busy, so maybe it's better to just copy my local test-patch output here (for now):

{noformat}
[exec] -1 overall.  
[exec] 
[exec]     +1 @author.  The patch does not contain any @author tags.
[exec] 
[exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
[exec] 
[exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
[exec] 
[exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
[exec] 
[exec]     -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.
[exec] 
[exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec] 
[exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
{noformat}

(The findbugs warning is about the fact that the name of the deprecated class in mapred shadows the simple name of the class it extends.)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HADOOP-5528 started by Klaas Bosteels.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694924#action_12694924 ] 

Hadoop QA commented on HADOOP-5528:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12404384/5528_20090401.patch
  against trunk revision 761082.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/96/console

This message is automatically generated.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HADOOP-5528) Binary partitioner

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko reopened HADOOP-5528:
-----------------------------------------


I had to revert the patch.
It needs to deal with deprecated classes, code replication and 32 javac warnings. See discussion in the linked issue.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-5528:
----------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this.
Thank you Klaas, Nicholas and Chris for working on this.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689849#action_12689849 ] 

Chris Douglas commented on HADOOP-5528:
---------------------------------------

Klaas,

Thanks for clarifying the python syntax; I hadn't realized that an absent left or right offset would use the beginning/end of the list. It's what I get for not reading the javadoc or that section carefully enough. I was expecting python-like syntax, per Owen's original [comment|https://issues.apache.org/jira/browse/HADOOP-5528?focusedCommentId=12684013&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12684013].

The one point unaddressed is whether the partitioner should honor the indices when they exceed the key length. On this, I continue to disagree with the current patch. Particularly when dealing with raw bytes, it is better to error out and require user intervention than to silently proceed. Too often, this will lead to unexpected and undetected results.

The class hierarchy still seems unnecessary to me. Since it's clear that what it effects can be mapped to the aforementioned expression, adjusting the user-provided indices to match the python semantics within that mapping takes less code, is more readily understandable, and is more efficient.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693555#action_12693555 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

The failed tests are unrelated to the patch, and I think the current semantics are fine, so feel free to commit, Chris.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694928#action_12694928 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

The findbugs warning is unavoidable (see also above), so, as far as I'm concerned, this new patch can be committed. Thanks for your help, Nicholas.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Status: Patch Available  (was: Open)

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690132#action_12690132 ] 

Chris Douglas commented on HADOOP-5528:
---------------------------------------

FWIW, I like the current semantics. If you're OK with them, I'll commit this patch.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683930#action_12683930 ] 

Owen O'Malley commented on HADOOP-5528:
---------------------------------------

I don't see the issue. How is it different specifying:

offset = 1
length = 5

rather than

left = 1
right = 6

Am I missing something?

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Attachment: HADOOP-5528.patch

Thanks Chris. The new patch addresses all of the issues you raised:

* The deprecated class now extends the one from mapreduce.
* Static convenience methods for setting the offsets were added.
* The indexer classes are private now (and hence also not configurable anymore).

Since the offset properties still might need to be set manually when this partitioner is used by a non-Java Hadoop program (e.g. a Streaming program), they are still documented in the javadoc for the class, but I added a paragraph that encourages Java programmers to use the static convenience methods instead.

Findbugs will probably complain about the the fact that the name of the deprecated class in mapred shadows the simple name of the class it extends, by the way.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683766#action_12683766 ] 

Hadoop QA commented on HADOOP-5528:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12402567/HADOOP-5528.patch
  against trunk revision 756256.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/111/console

This message is automatically generated.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689260#action_12689260 ] 

Chris Douglas commented on HADOOP-5528:
---------------------------------------

* It would make sense for the deprecated partitioner class to extend the partitioner from mapreduce, per the idiom used in HADOOP-1230
* Code using BinaryPartitioner might be more readable if it had a static method for setting the left and right indices in the config. There are other examples in lib
* The indexer class hierarchy confuses me. Instead of making the semantics configurable, defining and describing them in the aforementioned static method would be much easier. I'd also be a fan of using the python semantics, all the time.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5528:
----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.21.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I committed this. Thanks, Klaas

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694770#action_12694770 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-5528:
------------------------------------------------

It needs <?> in the test.
{code}
+    BinaryPartitioner<?> partitioner = 
+      ReflectionUtils.newInstance(BinaryPartitioner.class, conf);
{code}

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Tim Sell (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Sell updated HADOOP-5528:
-----------------------------

    Attachment: HADOOP-5528-0.18.patch

We needed this for 0.18, so I back ported it. Attached patch includes HADOOP-4151 tweaked for 0.18. All tests passed.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528-0.18.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683986#action_12683986 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

The meaning is different yes. Specifying _left = x_ means that the _x_ leftmost bytes will not be taken into account and, simlarily, specifying _right = y_ means that the _y_ rightmost bytes will not be taken into account. So the right offset is not relative to the first byte of the array, but to the last one.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694779#action_12694779 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

Ah right, missed that (I was looking for @SuppressWarnings annotations). I'll close HADOOP-5604 then, since this patch fixes it.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>             Fix For: 0.21.0
>
>         Attachments: 5528_20090401.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683921#action_12683921 ] 

Klaas Bosteels commented on HADOOP-5528:
----------------------------------------

Well, I basically created this issue because I wanted the hashing to ignore the last _n_ bytes from variable length keys, so specifying the offset and length wouldn't really work for me. Maybe we could support both and only take into account the length when both a right offset and a length are specified?

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12684013#action_12684013 ] 

Owen O'Malley commented on HADOOP-5528:
---------------------------------------

I understand now. right is defined as bytes from the right to ignore. Unfortunately that means you *can't* use it to pick up bytes 4-8 if you don't know the length. How about using python style offsets where negative numbers means count from the right. That will allow a lot more flexibility.

0 = start of bytes
1 = after first byte
-2 = before last byte
-1 = end of bytes

So left=4, right=8 would use bytes 4-8 from the right.
Left=-5, right=-1 would use the last 4 bytes.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5528) Binary partitioner

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690000#action_12690000 ] 

Chris Douglas commented on HADOOP-5528:
---------------------------------------

Please don't hesitate to pick the semantics. After all, this is your patch. If you prefer python semantics to "python-like," then by all means: go that route. I mentioned the original comment only to explain where I was coming from, as part of our disagreement stemmed from a mismatch in our understanding. I have no opinion on which are "better", as both are equivalently expressive.

I'm +1 on the current patch, and would not object if the indices were adjusted to fit the former semantics.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5528) Binary partitioner

Posted by "Klaas Bosteels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-5528:
-----------------------------------

    Attachment: HADOOP-5528.patch

Corrected small mistake and extended unit tests accordingly.

> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions {{BinaryComparable}} keys by hashing a configurable part of the bytes array corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.