You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Dave Revell (Created) (JIRA)" <ji...@apache.org> on 2011/09/27 00:01:12 UTC

[jira] [Created] (HBASE-4489) Better key splitting in RegionSplitter

Better key splitting in RegionSplitter
--------------------------------------

                 Key: HBASE-4489
                 URL: https://issues.apache.org/jira/browse/HBASE-4489
             Project: HBase
          Issue Type: Improvement
            Reporter: Dave Revell


The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.

A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126806#comment-13126806 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

@Dave: go for it
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116685#comment-13116685 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

I don't object to adding tests. I can have them by next Monday if someone else doesn't write them first.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126270#comment-13126270 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

@Dave: 

I think the main disconnect here how is we envision accomplishing the goal of 'better key splitting'.  I think your patch provides a different way to split.  However, both algorithms provide a splitting algorithm but do not provide an easy way to normalize your row into this keyspace.  We both assume you are instinctively normalizing to begin with.  I was suggesting adding a static normalization function to both algorithms, in addition to the comments you provided.

I still think that keeping the algorithm as ASCII would be better for readability & usage.  
1. Basically, we're telling users that they need to learn the JRuby byte translation API to issue get commands from the shell if they follow the UniformSplit strategy.   I use the shell far more than I create new tables.  
2. It's 8 bits per nibble.  The 2x savings will go away with Delta Encoding & naively using the default UINT128 sha1 would take up more space and be less compressible than an ASCII UINT32 anyways.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116642#comment-13116642 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

bq. There were no tests on the previous code
I think lack of unit test is not a gating item for this JIRA.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell updated HBASE-4489:
-------------------------------

    Affects Version/s: 0.90.4
               Status: Patch Available  (was: Open)
    
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell updated HBASE-4489:
-------------------------------

    Release Note: The split algorithm used by RegionSplitter is now a required parameter. Previously there was one split algorithm called MD5StringSplit, which was the default. MD5StringSplit has been renamed to HexStringSplit, and tweaked so that its maximum key is now "FFFFFFFF" instead of "7FFFFFFF. A new split algorithm UniformSplit has been added which treats keys as arbitrary bytes.
    
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126302#comment-13126302 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

btw: Since you're putting this as an explicit option on the command line, the proper name of that class should probably be HexStringSplit instead of MD5StringSplit.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell updated HBASE-4489:
-------------------------------

    Attachment: HBASE-4489-trunk-v5.patch

Nicolas, agreed. New patch for trunk (v5) attached that incorporates your feedback.

The v3 patch is still the most current patch for 0.90.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Jonathan Hsieh (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116774#comment-13116774 ] 

Jonathan Hsieh commented on HBASE-4489:
---------------------------------------

@Dave

My main suggestion is fixing based on the original author's intent in one patch (fixing the ascii encoded hex 7f problem) and then potentially changing the semantics/default in a different patch.  I believe we agree that the intent of the original author's code looks to be for ascii hex ranges and that the 0x7f max is broken.

In the tables I've encountered, it seems more folks who just use ascii rowkeys than use binary rowkeys.  Using the uniform byte range split keys for ascii character ranges -- would make the new alternate default just a "wrong" for many users.  The shell provides a generic mechanism for generating splits for new tables now (HBASE-4000) so it seems like using that completely generic approach seems more useful given knowledge about your particular row keys.

>From a code skim, it seems that rollingSplits is "smarter" - it take existing row key boundaries and split them at region midpoints.  This is still vulnerable to skewed rows key distributions but at least takes into account the existing rowkey ranges!



                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123577#comment-13123577 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Nicolas, responding to your points one-by-one:

1 and 2. This is a perfectly valid key format (choosing readability over space efficiency). Hashed keys are good. There is no disagreement here, but as a default it is inefficient for people who don't need to print their HBase keys, or are willing to parse their keys before printing. Claims of equal compressibility need evidence (raw bytes vs. ASCII)
3. Why should Java integers or longs have anything to do with an MD5 hash, which is a sequence of 16 bytes? Do we expect clients to truncate their MD5 hashes and make sure the high-order bit is a 0 (as required to be in the range 0-7F)? This is a bizarre default and puts a strange burden on clients, whose MD5 generator is giving them an arbitrary 128-bit array.
4. What unevenness are you referring to? If you're referring to the unevenness that results from using arbitrary keys in a partitioning scheme designed for ASCII, it can be quite bad. The first ASCII character, '0', has the ordinal 48. So the first region would cover the range (empty)..48), which is 48/256 = 18% of the key space, regardless of how many regions there are.

Maybe we should chat on freenode or Monday? I think it would be fast and easy to figure out where we disagree if we were chatting in realtime. Also, thanks for the input.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell updated HBASE-4489:
-------------------------------

    Attachment: HBASE-4489-branch0.90-v3.patch
                HBASE-4489-trunk-v3.patch

v3 patches with suggested fixes.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120382#comment-13120382 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

@Dave:

Some more clarity on MD5StringSplit:

1. In the original RegionSplitter use case, the key is MD5(username) + username.  In the general case, users probably should use an MD5 algorithm to partition their data if there is not access locality between 2 adjacent keys.  This provides proper random dispersion and helps avoid hot row issues.
2. It was kept as ASCII for readability on WebUI and logs.  The original application was IO bound & the data size was ~1000 bytes/entry, so key size saving was not a huge issue.  Additionally, key compression (HBASE-4218) will be available shortly and make byte code optimization pretty negligible compared to readability benefits.
3. The end range is not a bug. Java doesn't natively support uint32, so 7FFF is Max Int for an MD5 hash unless you use int64 in your calculations.  I'm guessing that you're using the Thrift API, but you probably should code for the native interface.
4. The uneven key space can (and probably should) be fixed, but it is not a significant issue with a large number of regions.  For N regions across a range of K, the common region size is floor(K/N) & the skewed region is K - (N-1) * floor(K/N) == floor(K/N) + K % N < 2N.  So the worse case is that one region has double load.  For 10 regions/server, 10% variance.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Jonathan Hsieh (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116951#comment-13116951 ] 

Jonathan Hsieh commented on HBASE-4489:
---------------------------------------

@Ted, sounds good to me.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115057#comment-13115057 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

Ted: I'm not clear what you're suggesting. Do you want to see length checking of the arrays returned from Bytes.split()?
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115052#comment-13115052 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

The bug in MD5StringSplit I mentioned in my earlier comment occurs in the definition of the variable MAXMD5 in RegionSplitter.java.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114973#comment-13114973 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

I looked at patch for TRUNK:
{code}
+      byte[][] splitKeysPlusEndpoints = Bytes.split(firstRowBytes, lastRowBytes,
+              numRegions-1);
{code}
It seems range checking is missing.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128047#comment-13128047 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

+1 on HBASE-4489-trunk-v5.patch
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell updated HBASE-4489:
-------------------------------

    Attachment: HBASE-4489-trunk-v4.patch

New patch for trunk with last night's feedback. 
 - The SplitAlgorithm to use is now a required parameter
 - MD5StringSplit is now called HexStringSplit
 - The ceiling of 7FFFFFFF is now FFFFFFFF, and tests were changed to accomodate this.

No changes are necessary on the 0.90 branch, so the -v3 patch for 0.90 is still current.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell reassigned HBASE-4489:
----------------------------------

    Assignee: Dave Revell
    
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Jonathan Hsieh (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125118#comment-13125118 ] 

Jonathan Hsieh commented on HBASE-4489:
---------------------------------------

@Dave

Part of me really just would prefer decouple rollingSplit from the  presplit min/max value selection -- maybe change this in to two programs -- a custom presplit table generator program that handles key bounds, and a separate rollingSplit program that just splits based on given key ranges.

I thought that there was agreement that we would keep MD5StringSplit as default for 0.90.  It looks like the default was changed to UniformSplit from MD5StringSplit in both patches.   While I generally agree with your point #3, it is a in 0.90 and would be a compatibility problem for anyone who depends on it.   Would it make sense to change the default in trunk/0.92 (I'm fine with that) but leave 0.90.x as is?    

Nice functional test.  Did you consider just doing a unit test on the split algorithm along with the cluster spinning functional test?  I believe HBaseAdmin.create(HTableDescriptor htd,byte startKeys[][]) is well tested and would make the non @Ignored portions quicker.  I can see how you need this setup for testing rollingSplit.

Interesting div 0 bug.  More testing, less surprises!

Any reason why in testCreatePressplitTable you go to -0x71, 0x81 .. -0x11 instead of just going to 0x8f, 0x9f .. 0xff?  Though more verbose,  I think it is easier to read and follow if you use "positive" hex and cast all of them with (byte), or write out single longs and convert?


                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Todd Lipcon (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116683#comment-13116683 ] 

Todd Lipcon commented on HBASE-4489:
------------------------------------

bq. I think lack of unit test is not a gating item for this JIRA.

Why not? Lack of unit tests is what caused the bug in the first place. This is trivially testable.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142642#comment-13142642 ] 

Hudson commented on HBASE-4489:
-------------------------------

Integrated in HBase-TRUNK #2402 (See [https://builds.apache.org/job/HBase-TRUNK/2402/])
    [jira] [HBASE-4627] Ability to specify a custom start/end to RegionSplitter

Summary:
[HBASE-4627]

added a custom start/end row to RegionSplitter.  Also solved
an off-by-one error because the end row is prefix-inclusive and not
exclusive.

<a href="https://issues.apache.org/jira/browse/HBASE-4489" title="Better key splitting in RegionSplitter"><del>HBASE-4489</del></a> changed the default endKey on HexStringSplit from 7FFF... to FFFF...  While this is correct, existing users of 0.90 RegionSplitter have 7FFF as the end key in their schema and the last region will not split properly under this new code.  We need to let the user specify a custom start/end key range for when situations like this arise.  Optimally, we should also write the start/end key in META so we could figure this out implicitly instead of requiring the user to explicitly specify it.

Test Plan:
 - mvn test -Dtest=TestRegionSplitter

CC: JIRA

Reviewers: DUMMY_REVIEWER

Differential Revision: 39

nspiegelberg : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestRegionSplitter.java

                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>             Fix For: 0.94.0
>
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120277#comment-13120277 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

I haven't forgotten about this. I'll write the tests as soon as I can make time, hopefully by the end of the week.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116648#comment-13116648 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

+1 on Dave's patches.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Spiegelberg updated HBASE-4489:
---------------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.90.5)
                   0.94.0
           Status: Resolved  (was: Patch Available)

putting in 0.94 since it is an improvement and we don't plan on a long release cycle between 0.92 and 0.94
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>             Fix For: 0.94.0
>
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124756#comment-13124756 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

Please add license to TestRegionSplitter.java
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131274#comment-13131274 ] 

Hudson commented on HBASE-4489:
-------------------------------

Integrated in HBase-TRUNK #2343 (See [https://builds.apache.org/job/HBase-TRUNK/2343/])
    HBASE-4489 Better key splitting in RegionSplitter

nspiegelberg : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/util/TestRegionSplitter.java

                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>             Fix For: 0.94.0
>
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127167#comment-13127167 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

Review comments:

1. With the 'newSplitAlgoInstance' factory, should we have HexStringSplit & UniformSplit be special words to instantiate those classes?  Or do they not need any prefix before the classname because they are in the same namespace?  I just think, in general, you want to run: 
{code}bin/hbase RegionSplitter UniformSplit table{code}
2. We should probably put the 2 well-known class options in the help text.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126292#comment-13126292 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

Re: normalization, I think nspiegelberg agreed in #hbase that normalization can happen in a different ticket. Right nspiegelberg?
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Jonathan Gray (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126282#comment-13126282 ] 

Jonathan Gray commented on HBASE-4489:
--------------------------------------

Historically ASCII has proven a bad choice in key design.  If it's always fixed length, it's less of a big deal and really does come down to space savings vs. readability.  In many applications, row keys are composite keys made up of many different things.  Often times, the key may be preceded by some fixed-length random hash of some sort.

I almost always want to be building these composite keys from fixed-length binary ints/longs and such, rather than fixed-length ascii characters.

If we are talking a straightforward key-val situation with a string-like key, then the usability of ASCII would make sense.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell updated HBASE-4489:
-------------------------------

    Attachment: HBASE-4489-trunk-v1.patch
                HBASE-4489-branch0.90-v1.patch
    
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116639#comment-13116639 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Jonathan Hsieh, thanks for your thoughts.

When you say you agree with jgray: he actually wants to do two things. (1) stop using ASCII and (2) remove the 0x7F range bug. It sounds like you only agree with removing the 0x7F range bug but not with avoiding ASCII, for the default split algorithm? 

I agree in principle with your comment about preserving behavior between minor releases. If there were a valid use case for the existing code, I would agree that we should leave it. But given its current brokenness, we should fix it all the way instead of creating an intermediate slightly-broken state that falls short of a real fix. We're already breaking any existing use cases by virtue of fixing the range bug. We should not create another generation of broken use cases before making the real fix, IMO.

I agree that tests would be a good idea. I'll hopefully find some time for that soon.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126767#comment-13126767 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

nspiegelberg: going to change the upper bound for MD5StringSplit from 7FFF... to FFFF...  OK with you?
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125495#comment-13125495 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Nicolas, did you know HBase keys are lexicographically ordered? http://wiki.apache.org/hadoop/Hbase/DataModel

Otherwise I can't see how variables (integer, long) are relevant to the problem.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-4489:
--------------------------

    Fix Version/s: 0.90.5
    
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126301#comment-13126301 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

@Dave : correct on normalization function.  I'll open up another ticket
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115068#comment-13115068 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

@Dave:
I meant that numRegions could be negative. This is minor.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Mingjie Lai (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114982#comment-13114982 ] 

Mingjie Lai commented on HBASE-4489:
------------------------------------

Dave. 

Changing a default configuration may affect existing users:

{code}
     Class<? extends SplitAlgorithm> splitClass = conf.getClass(
-        "split.algorithm", MD5StringSplit.class, SplitAlgorithm.class);
+        "split.algorithm", UniformSplit.class, SplitAlgorithm.class);
     SplitAlgorithm splitAlgo;
{code}


                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126290#comment-13126290 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Nicolas and I just chatted in #hbase. Now we think that forcing the user to specify the split algorithm makes the most sense; we don't want to implicitly assume anything about the user's key distribution since badness could happen in either case. Jonathan Gray agreed with the idea.

Everyone OK with this?
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Jonathan Gray (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115211#comment-13115211 ] 

Jonathan Gray commented on HBASE-4489:
--------------------------------------

+1 that keyspace split should be 0x00..00 to 0xff..ff and not ascii or 0x7f.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115870#comment-13115870 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Ted: Bytes.split() will check the number of splits and throw IllegalArgumentException if <=0. It seems like a shame to add more code to duplicate a check that's already being done.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115054#comment-13115054 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

bq. since an MD5 hash is just a 128-bit number and can start with any digit.
The above was confirmed by the creator of MD5 hash.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131183#comment-13131183 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

Code-wise, I think everything looks fine.  I'm going to make a minor adjustment to the unit tests prior to commit. Namely: we should not add unit tests and then put down @Ignore.  Instead, remove these unit tests and add them back in with HBASE-4567.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch, HBASE-4489-trunk-v4.patch, HBASE-4489-trunk-v5.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115051#comment-13115051 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

Mingjie, I would agree with you if the existing behavior was sane, but it has some problems:

1. Using ASCII strings as keys is a poor choice, and to have it be a default in a builtin tool would send the wrong message. Since HFiles repeat the key for every cell in the table, small key size is very important.

2. The MD5StringSplit class contains a bug that makes the current behavior even less sane. It assumes that an ASCII hex representation of an MD5 hash begins with 0, 1, 2, 3, 4, 5, 6, or 7. This is incorrect, since an MD5 hash is just a 128-bit number and can start with any digit. The result will be a single oversized region at the high end of the key space.

So as far as I can tell, the existing behavior does the wrong thing, and furthermore does it wrongly. We shouldn't preserve this situation. 

If I've misunderstood the situation I definitely welcome corrections.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125481#comment-13125481 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Nicolas,

In response to your #1:
> "If you want to save space, you should probably switch to a UINT32 for the key range instead of the current INT64. This should scale for up to 2 million regions."

Users can use whatever length of keys they want. RegionSplitter just chooses the region boundaries, which has no effect on space consumption.

In response to your #2:
Every cryptographic hash function distributes its values uniformly across the space of byte strings of length N. So that makes UniformSplit a sensible default when using MD5 hashes or SHA1 hashes or whatever else, as long as they're not converted to ASCII. 

The goal here is sane default behavior that makes sense for typical use cases. Evenly dividing the key space accomplishes that goal. At least that's *my* goal.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116839#comment-13116839 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Ted, fine with me.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Nicolas Spiegelberg (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125474#comment-13125474 ] 

Nicolas Spiegelberg commented on HBASE-4489:
--------------------------------------------

1. If you want to save space, you should probably switch to a UINT32 for the key range instead of the current INT64.  This should scale for up to 2 million regions.

2. We (FB) have this specified for MD5StringSplit elsewhere, but I think an mandatory requirement if your goal is "better key splitting" is to provide the hash function that, given a user-specified key, hash it and return what you need to insert into HBase.  Maybe add a generateKey() function to SplitAlgorithm interface.  Worst case is we make it an identity operation.  

Talked with StumbleUpon & Salesforce guys last week about making this something that users can enable and the HBase Client would transparently do for them.  However, step one is giving users the APIs they need to manually do proper key distribution.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-branch0.90-v3.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch, HBASE-4489-trunk-v3.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116830#comment-13116830 ] 

Ted Yu commented on HBASE-4489:
-------------------------------

@Dave, @Jonathan:
Shall we do the following to move this forward ?
* fix the range bug in MD5StringSplit
* provide unit tests
* keep MD5StringSplit as the default

Thanks
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Jonathan Hsieh (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116021#comment-13116021 ] 

Jonathan Hsieh commented on HBASE-4489:
---------------------------------------

A few thoughts:

I agree with jgray -- I think one fix should correct the MD5 string split so that it splits from 0x00.. 0xff.  I think there could be another separate patch that adds the UniformSplit.  

I'd be wary of changing the default, especially if this is means to go into a 0.90.x branch.  It looks like as a user you can add and use the UniformSplit by changing the conf option. 

Ideally patches with new functionality or changing semantics would also introduce corresponding tests.  There were no tests on the previous code, and no tests in on the newly introduced code.  Adding tests especially around edge cases could accommodate Ted's concerns, and it doesn't really hurt to be extra defensive when coding on non-performance sensitive code.


                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-trunk-v1.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125196#comment-13125196 ] 

Dave Revell commented on HBASE-4489:
------------------------------------

@Ted: will add license in patch -v3.

@Jonathan: 
 * We did agree to leave MD5StringSplit as the default in 0.90, I'll fix that
 * I have no objection to turning rollingsplit into a different utility. That seems out of scope here though; my intent in this ticket was to make RegionSplitter do something sane.
 * Unit tests are a good idea, I'll add some
 * Using positive hex with (byte) casting is a good idea, I'll change that



                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4489) Better key splitting in RegionSplitter

Posted by "Dave Revell (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Revell updated HBASE-4489:
-------------------------------

    Attachment: HBASE-4489-branch0.90-v2.patch
                HBASE-4489-trunk-v2.patch

New patches ending in -v2. These have new tests for RegionSplitter.

Some weirdness: RegionSplitter.rollingSplit() seems to be broken, so it doesn't have any test cases in my code. I opened HBASE-4567 to focus on this. I also included a test case in TestRegionSplitter.java called reproduceDivByZeroFailure() that reproduces the problem. I think fixing this bug is outside the scope of this ticket.
                
> Better key splitting in RegionSplitter
> --------------------------------------
>
>                 Key: HBASE-4489
>                 URL: https://issues.apache.org/jira/browse/HBASE-4489
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.4
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: HBASE-4489-branch0.90-v1.patch, HBASE-4489-branch0.90-v2.patch, HBASE-4489-trunk-v1.patch, HBASE-4489-trunk-v2.patch
>
>
> The RegionSplitter utility allows users to create a pre-split table from the command line or do a rolling split on an existing table. It supports pluggable split algorithms that implement the SplitAlgorithm interface. The only/default SplitAlgorithm is one that assumes keys fall in the range from ASCII string "00000000" to ASCII string "7FFFFFFF". This is not a sane default, and seems useless to most users. Users are likely to be surprised by the fact that all the region splits occur in in the byte range of ASCII characters.
> A better default split algorithm would be one that evenly divides the space of all bytes, which is what this patch does. Making a table with five regions would split at \x33\x33..., \x66\x66...., \x99\x99..., \xCC\xCC..., and \xFF\xFF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira