You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "ryan rawson (JIRA)" <ji...@apache.org> on 2010/02/23 09:55:27 UTC

[jira] Created: (HBASE-2251) PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks

PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks
----------------------------------------------------------------------

                 Key: HBASE-2251
                 URL: https://issues.apache.org/jira/browse/HBASE-2251
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: ryan rawson
             Fix For: 0.20.4, 0.21.0


The PerformanceEvaluation uses 1k rows, which I would argue is uncommon, and also provides an easy to hit performance goal.  Most of the harder performance issues happens at the low and high side of cell size.  In our own application, our key sizes range from 4 bytes to maybe 100 bytes.  Very rarely 1000 bytes.  If we have large values, they are VERY large, like multiple k sizes.

Recently a change went into HBase that ran well with PE because the overhead of 1k rows is very low in memory, but under small rows, the expected performance would be hit much more.  This is because the per-value overhead (eg: node objects of the skip list/memstore) is amortized more with 1k values. 

We should make this a tunable setting, and have a low default.  I would argue for a 10-30 byte default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2251) PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837317#action_12837317 ] 

Todd Lipcon commented on HBASE-2251:
------------------------------------

bq. Then I can write a Hudson plugin that fails a build if performance is out of line beyond some threshold. What do you think?

Even in the absence of automatically failing builds, Hudson has a facility to easily generate a graph with build # on the x axis and arbitrary data on the y axis - you just have to generate the data in .properties format for each build. At a web company I worked for in the past, we had graphs for # db queries, # cache queries, page load time, etc, for each of the important pages on the site. It was very easy to spot bad commits, but also easy to see if we were inching up slowly over time (even more insidious than a bad commit imo).

> PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks
> ----------------------------------------------------------------------
>
>                 Key: HBASE-2251
>                 URL: https://issues.apache.org/jira/browse/HBASE-2251
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>             Fix For: 0.20.4, 0.21.0
>
>
> The PerformanceEvaluation uses 1k rows, which I would argue is uncommon, and also provides an easy to hit performance goal.  Most of the harder performance issues happens at the low and high side of cell size.  In our own application, our key sizes range from 4 bytes to maybe 100 bytes.  Very rarely 1000 bytes.  If we have large values, they are VERY large, like multiple k sizes.
> Recently a change went into HBase that ran well with PE because the overhead of 1k rows is very low in memory, but under small rows, the expected performance would be hit much more.  This is because the per-value overhead (eg: node objects of the skip list/memstore) is amortized more with 1k values. 
> We should make this a tunable setting, and have a low default.  I would argue for a 10-30 byte default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2251) PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837327#action_12837327 ] 

stack commented on HBASE-2251:
------------------------------

Whats being described sounds like the yahoo tool thats supposed to be open sourced any time soon.

While I think these additions to PE would be sweet, before this, before each release, we need to run perf tests so we find these slow downs before release -- even if its only PE (though as Dan Washuen pointed out -- PE currently clears memstore so its not factored in PE evals -- that needs fixing).

> PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks
> ----------------------------------------------------------------------
>
>                 Key: HBASE-2251
>                 URL: https://issues.apache.org/jira/browse/HBASE-2251
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>             Fix For: 0.20.4, 0.21.0
>
>
> The PerformanceEvaluation uses 1k rows, which I would argue is uncommon, and also provides an easy to hit performance goal.  Most of the harder performance issues happens at the low and high side of cell size.  In our own application, our key sizes range from 4 bytes to maybe 100 bytes.  Very rarely 1000 bytes.  If we have large values, they are VERY large, like multiple k sizes.
> Recently a change went into HBase that ran well with PE because the overhead of 1k rows is very low in memory, but under small rows, the expected performance would be hit much more.  This is because the per-value overhead (eg: node objects of the skip list/memstore) is amortized more with 1k values. 
> We should make this a tunable setting, and have a low default.  I would argue for a 10-30 byte default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2251) PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837464#action_12837464 ] 

ryan rawson commented on HBASE-2251:
------------------------------------

zipf is ok, but it may not accurate represent common data use patterns.

What I am trying to say here is that big cells represent one scaling challenge, and small cells a different one.  Users often have one or the other, but not a whole lot inbetween.  Our systems use either small cells or huge ones ( > 2k).  The small cells place a higher load, one specific example being the node objects in the memstore kvset.  This is what was causing the clone issues.

hence we need to accurately simulate objects from the 1-50ish byte size area, and the 1000-12000 (or larger) byte size area.  Using a zipf distribution in each thereof would be reasonable I think.

> PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks
> ----------------------------------------------------------------------
>
>                 Key: HBASE-2251
>                 URL: https://issues.apache.org/jira/browse/HBASE-2251
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>             Fix For: 0.20.4, 0.21.0
>
>
> The PerformanceEvaluation uses 1k rows, which I would argue is uncommon, and also provides an easy to hit performance goal.  Most of the harder performance issues happens at the low and high side of cell size.  In our own application, our key sizes range from 4 bytes to maybe 100 bytes.  Very rarely 1000 bytes.  If we have large values, they are VERY large, like multiple k sizes.
> Recently a change went into HBase that ran well with PE because the overhead of 1k rows is very low in memory, but under small rows, the expected performance would be hit much more.  This is because the per-value overhead (eg: node objects of the skip list/memstore) is amortized more with 1k values. 
> We should make this a tunable setting, and have a low default.  I would argue for a 10-30 byte default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2251) PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837298#action_12837298 ] 

Todd Lipcon commented on HBASE-2251:
------------------------------------

I would argue a zipf law distribution probably reflects reality. Constant sized rows are probably friendlier on gc / fragmentation / etc, no? There should at least be *some* rows that are bigger than the hfile block size, too.

> PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks
> ----------------------------------------------------------------------
>
>                 Key: HBASE-2251
>                 URL: https://issues.apache.org/jira/browse/HBASE-2251
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>             Fix For: 0.20.4, 0.21.0
>
>
> The PerformanceEvaluation uses 1k rows, which I would argue is uncommon, and also provides an easy to hit performance goal.  Most of the harder performance issues happens at the low and high side of cell size.  In our own application, our key sizes range from 4 bytes to maybe 100 bytes.  Very rarely 1000 bytes.  If we have large values, they are VERY large, like multiple k sizes.
> Recently a change went into HBase that ran well with PE because the overhead of 1k rows is very low in memory, but under small rows, the expected performance would be hit much more.  This is because the per-value overhead (eg: node objects of the skip list/memstore) is amortized more with 1k values. 
> We should make this a tunable setting, and have a low default.  I would argue for a 10-30 byte default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2251) PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837309#action_12837309 ] 

Andrew Purtell commented on HBASE-2251:
---------------------------------------

A zipf law distribution is good for simulating web sourced content. We run internal performance benchmarks based on that. So +1 on that notion. 

We should also include runs with all data items as serialized longs, another use case that will be common I would expect. I think this is what Ryan was getting at. 

Also while we're here, I have a wish that PE had a mode where if given no arguments other than number of clients performs the full suite of performance tests and dumps the result as plain text and also as XML if a command line flag toggles it. Then I can write a Hudson plugin that fails a build if performance is out of line beyond some threshold. What do you think? 

> PE defaults to 1k rows - uncommon use case, and easy to hit benchmarks
> ----------------------------------------------------------------------
>
>                 Key: HBASE-2251
>                 URL: https://issues.apache.org/jira/browse/HBASE-2251
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>             Fix For: 0.20.4, 0.21.0
>
>
> The PerformanceEvaluation uses 1k rows, which I would argue is uncommon, and also provides an easy to hit performance goal.  Most of the harder performance issues happens at the low and high side of cell size.  In our own application, our key sizes range from 4 bytes to maybe 100 bytes.  Very rarely 1000 bytes.  If we have large values, they are VERY large, like multiple k sizes.
> Recently a change went into HBase that ran well with PE because the overhead of 1k rows is very low in memory, but under small rows, the expected performance would be hit much more.  This is because the per-value overhead (eg: node objects of the skip list/memstore) is amortized more with 1k values. 
> We should make this a tunable setting, and have a low default.  I would argue for a 10-30 byte default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.