You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Oliver Meyn (Created) (JIRA)" <ji...@apache.org> on 2012/02/15 15:36:59 UTC

[jira] [Created] (HBASE-5402) PerformanceEvaluation creates the wrong number of rows in randomWrite

PerformanceEvaluation creates the wrong number of rows in randomWrite
---------------------------------------------------------------------

                 Key: HBASE-5402
                 URL: https://issues.apache.org/jira/browse/HBASE-5402
             Project: HBase
          Issue Type: Bug
          Components: test
            Reporter: Oliver Meyn


The command line 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' should result in a table with 10 * (1024 * 1024) rows (so 10485760).  Instead what happens is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only e.g 6549899 rows.  A second attempt to build the table produced slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected).

Further experimentation reveals that the problem is key collision - by removing the % totalRows in getRandomRow I saw a reduction in collisions (table was ~8M rows instead of 6.6M).  Replacing the random row key with UUIDs instead of Integers solved the problem and produced exactly 10485760 rows.  But that makes the key size 16 bytes instead of the current 10, so I'm not sure that's an acceptable solution.

Here's the UUID code I used:

  public static byte[] format(final UUID uuid) {
    long msb = uuid.getMostSignificantBits();
    long lsb = uuid.getLeastSignificantBits();
    byte[] buffer = new byte[16];

    for (int i = 0; i < 8; i++) {
      buffer[i] = (byte) (msb >>> 8 * (7 - i));
    }
    for (int i = 8; i < 16; i++) {
      buffer[i] = (byte) (lsb >>> 8 * (7 - i));
    }

    return buffer;
  }

which is invoked within getRandomRow with 

return format(UUID.randomUUID());

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5402) PerformanceEvaluation creates the wrong number of rows in randomWrite

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208618#comment-13208618 ] 

Jean-Daniel Cryans commented on HBASE-5402:
-------------------------------------------

I don't see this is an issue, it does create the right amount of rows if you consider counting versions.
                
> PerformanceEvaluation creates the wrong number of rows in randomWrite
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5402
>                 URL: https://issues.apache.org/jira/browse/HBASE-5402
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Oliver Meyn
>
> The command line 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' should result in a table with 10 * (1024 * 1024) rows (so 10485760).  Instead what happens is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only e.g 6549899 rows.  A second attempt to build the table produced slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected).
> Further experimentation reveals that the problem is key collision - by removing the % totalRows in getRandomRow I saw a reduction in collisions (table was ~8M rows instead of 6.6M).  Replacing the random row key with UUIDs instead of Integers solved the problem and produced exactly 10485760 rows.  But that makes the key size 16 bytes instead of the current 10, so I'm not sure that's an acceptable solution.
> Here's the UUID code I used:
>   public static byte[] format(final UUID uuid) {
>     long msb = uuid.getMostSignificantBits();
>     long lsb = uuid.getLeastSignificantBits();
>     byte[] buffer = new byte[16];
>     for (int i = 0; i < 8; i++) {
>       buffer[i] = (byte) (msb >>> 8 * (7 - i));
>     }
>     for (int i = 8; i < 16; i++) {
>       buffer[i] = (byte) (lsb >>> 8 * (7 - i));
>     }
>     return buffer;
>   }
> which is invoked within getRandomRow with 
> return format(UUID.randomUUID());

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5402) PerformanceEvaluation creates the wrong number of rows in randomWrite

Posted by "Jean-Daniel Cryans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209531#comment-13209531 ] 

Jean-Daniel Cryans commented on HBASE-5402:
-------------------------------------------

bq. The problem is that the resulting table is what is used in any subsequent scan tests using PE, which are then double reading some rows, rather than reading every row once

Following this logic, how would a random read test work with keys that are UUIDs? You'll have to be lucky to get a couple of hits :)

bq. This is counter intuitive, and also introduces the possibility of cache hits, which I think is not what is expected by users doing a scan test.

Considering that blocks are 64KB and rows are ~1.5KB (keys+value), cache hits is going to happen no matter what.
                
> PerformanceEvaluation creates the wrong number of rows in randomWrite
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5402
>                 URL: https://issues.apache.org/jira/browse/HBASE-5402
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Oliver Meyn
>
> The command line 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' should result in a table with 10 * (1024 * 1024) rows (so 10485760).  Instead what happens is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only e.g 6549899 rows.  A second attempt to build the table produced slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected).
> Further experimentation reveals that the problem is key collision - by removing the % totalRows in getRandomRow I saw a reduction in collisions (table was ~8M rows instead of 6.6M).  Replacing the random row key with UUIDs instead of Integers solved the problem and produced exactly 10485760 rows.  But that makes the key size 16 bytes instead of the current 10, so I'm not sure that's an acceptable solution.
> Here's the UUID code I used:
>   public static byte[] format(final UUID uuid) {
>     long msb = uuid.getMostSignificantBits();
>     long lsb = uuid.getLeastSignificantBits();
>     byte[] buffer = new byte[16];
>     for (int i = 0; i < 8; i++) {
>       buffer[i] = (byte) (msb >>> 8 * (7 - i));
>     }
>     for (int i = 8; i < 16; i++) {
>       buffer[i] = (byte) (lsb >>> 8 * (7 - i));
>     }
>     return buffer;
>   }
> which is invoked within getRandomRow with 
> return format(UUID.randomUUID());

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5402) PerformanceEvaluation creates the wrong number of rows in randomWrite

Posted by "Oliver Meyn (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210131#comment-13210131 ] 

Oliver Meyn commented on HBASE-5402:
------------------------------------

JD has succinctly dismantled my UUID argument, so those are out :)  And I hadn't thought through the caching issue, so obviously need to learn some more.  I like Todd's idea of each mapper randomizing some fixed range of keys - that buys a predictable number of rows and also a defined key space for doing random reads.
                
> PerformanceEvaluation creates the wrong number of rows in randomWrite
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5402
>                 URL: https://issues.apache.org/jira/browse/HBASE-5402
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Oliver Meyn
>
> The command line 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' should result in a table with 10 * (1024 * 1024) rows (so 10485760).  Instead what happens is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only e.g 6549899 rows.  A second attempt to build the table produced slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected).
> Further experimentation reveals that the problem is key collision - by removing the % totalRows in getRandomRow I saw a reduction in collisions (table was ~8M rows instead of 6.6M).  Replacing the random row key with UUIDs instead of Integers solved the problem and produced exactly 10485760 rows.  But that makes the key size 16 bytes instead of the current 10, so I'm not sure that's an acceptable solution.
> Here's the UUID code I used:
>   public static byte[] format(final UUID uuid) {
>     long msb = uuid.getMostSignificantBits();
>     long lsb = uuid.getLeastSignificantBits();
>     byte[] buffer = new byte[16];
>     for (int i = 0; i < 8; i++) {
>       buffer[i] = (byte) (msb >>> 8 * (7 - i));
>     }
>     for (int i = 8; i < 16; i++) {
>       buffer[i] = (byte) (lsb >>> 8 * (7 - i));
>     }
>     return buffer;
>   }
> which is invoked within getRandomRow with 
> return format(UUID.randomUUID());

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5402) PerformanceEvaluation creates the wrong number of rows in randomWrite

Posted by "Oliver Meyn (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209227#comment-13209227 ] 

Oliver Meyn commented on HBASE-5402:
------------------------------------

I agree that the test itself is accurate in as much as it writes the number of rows it says it will (given multiple versions).  The problem is that the resulting table is what is used in any subsequent scan tests using PE, which are then double reading some rows, rather than reading every row once.  This is counter intuitive, and also introduces the possibility of cache hits, which I think is not what is expected by users doing a scan test.
                
> PerformanceEvaluation creates the wrong number of rows in randomWrite
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5402
>                 URL: https://issues.apache.org/jira/browse/HBASE-5402
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Oliver Meyn
>
> The command line 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' should result in a table with 10 * (1024 * 1024) rows (so 10485760).  Instead what happens is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only e.g 6549899 rows.  A second attempt to build the table produced slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected).
> Further experimentation reveals that the problem is key collision - by removing the % totalRows in getRandomRow I saw a reduction in collisions (table was ~8M rows instead of 6.6M).  Replacing the random row key with UUIDs instead of Integers solved the problem and produced exactly 10485760 rows.  But that makes the key size 16 bytes instead of the current 10, so I'm not sure that's an acceptable solution.
> Here's the UUID code I used:
>   public static byte[] format(final UUID uuid) {
>     long msb = uuid.getMostSignificantBits();
>     long lsb = uuid.getLeastSignificantBits();
>     byte[] buffer = new byte[16];
>     for (int i = 0; i < 8; i++) {
>       buffer[i] = (byte) (msb >>> 8 * (7 - i));
>     }
>     for (int i = 8; i < 16; i++) {
>       buffer[i] = (byte) (lsb >>> 8 * (7 - i));
>     }
>     return buffer;
>   }
> which is invoked within getRandomRow with 
> return format(UUID.randomUUID());

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5402) PerformanceEvaluation creates the wrong number of rows in randomWrite

Posted by "Todd Lipcon (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209550#comment-13209550 ] 

Todd Lipcon commented on HBASE-5402:
------------------------------------

Why don't we change it so that, instead of random, it just counts from 1 to 1M but puts the bits through some kind of blender? Easiest is to reverse the bit order, but could also do more creative swapping. Each mapper could also take its ascribed range of keys, break it into 100 subranges, and randomize each of the subranges.
                
> PerformanceEvaluation creates the wrong number of rows in randomWrite
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5402
>                 URL: https://issues.apache.org/jira/browse/HBASE-5402
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Oliver Meyn
>
> The command line 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' should result in a table with 10 * (1024 * 1024) rows (so 10485760).  Instead what happens is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only e.g 6549899 rows.  A second attempt to build the table produced slightly different results (e.g. 6627689).  I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected).
> Further experimentation reveals that the problem is key collision - by removing the % totalRows in getRandomRow I saw a reduction in collisions (table was ~8M rows instead of 6.6M).  Replacing the random row key with UUIDs instead of Integers solved the problem and produced exactly 10485760 rows.  But that makes the key size 16 bytes instead of the current 10, so I'm not sure that's an acceptable solution.
> Here's the UUID code I used:
>   public static byte[] format(final UUID uuid) {
>     long msb = uuid.getMostSignificantBits();
>     long lsb = uuid.getLeastSignificantBits();
>     byte[] buffer = new byte[16];
>     for (int i = 0; i < 8; i++) {
>       buffer[i] = (byte) (msb >>> 8 * (7 - i));
>     }
>     for (int i = 8; i < 16; i++) {
>       buffer[i] = (byte) (lsb >>> 8 * (7 - i));
>     }
>     return buffer;
>   }
> which is invoked within getRandomRow with 
> return format(UUID.randomUUID());

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira