You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Dave Latham (JIRA)" <ji...@apache.org> on 2010/02/23 00:36:27 UTC

[jira] Created: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

New MemStoreScanner copies memstore for each scan, makes short scans slow
-------------------------------------------------------------------------

                 Key: HBASE-2248
                 URL: https://issues.apache.org/jira/browse/HBASE-2248
             Project: Hadoop HBase
          Issue Type: Bug
    Affects Versions: 0.20.3
            Reporter: Dave Latham
             Fix For: 0.20.4
         Attachments: threads.txt

HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.

After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837562#action_12837562 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

@Dave: 

Correct you are.  I've added comments on HBASE-2249 as a result of your comments here...

It's worth noting that in the case of ScanTest the cost of setting up the ResultScanner is almost non-existent compared to the cost of scanning over the majority of table.  The ScanTest takes 23 seconds in total according to the log output (including opening the scanner etc).

Dave, the numbers I posted above (9ms) were from the RandomScanWithRangeTest.  As you mention, these tests include the cost of opening the scanner.  I was under the impression that this was closer to your use case (e.g. specify both a scan.startRow and scan.stopRow which returns a small number of rows)...?

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson updated HBASE-2248:
-------------------------------

    Summary: Provide new non-copy mechanism to assure atomic reads in get and scan  (was: New MemStoreScanner copies memstore for each scan, makes short scans slow)

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841488#action_12841488 ] 

Jonathan Gray commented on HBASE-2248:
--------------------------------------

Might be time to turn gets into scans so we don't have a second read code path.

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837326#action_12837326 ] 

Dave Latham commented on HBASE-2248:
------------------------------------

Thanks, Dan, and others for looking into this issue.  The table where we were seeing these slow scans was definitely a tall, narrow table.  Each row has one cell, the column family and qualifier are each one byte.  The row varies, but is typically 8-20 bytes, and the value is usually 4 bytes or less.  Most common is probably row - 12 bytes, col fam - 1 byte, qualifier 1 byte, value - 3 bytes, giving 17 bytes plus overhead.

As I was trying to understand the discrepancy between the PE results you mentioned and what I've observed, I looked in to PerformanceEvaluation.  It looks like the timer only starts after the scanner is constructed which means that the MemStore clone isn't being timed as part of the test, so that would probably explain why the test seems fast.  Just reasoning, it seems hard to believe that ConcurrentSkipListMap.buildFromSorted could complete a million iterations that fast.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson updated HBASE-2248:
-------------------------------

    Attachment: HBASE-2248-ryan.patch

Ok here is my proposal to fix this, hopefully once and for all.

The only thing that isn't covered is deletes:
- removing keyvalues wont ever be atomic
- we could stop deleting key values, but the get code would have to be checked
-- the flush would also need to prune out deleted key values to keep the delete invariant of 'get' going on.


> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "Yoram Kulbak (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841579#action_12841579 ] 

Yoram Kulbak commented on HBASE-2248:
-------------------------------------

Turning gets into scans will cause some minor functional changes. See for example the differences between gets and scans exposed in TestClient#testDeletes. IMHO eliminating the functional differences between gets and scans will be a change for the better but perhaps there are existing users which rely those subtle differences.

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2248:
-------------------------

    Attachment: Screen shot 2010-02-23 at 10.33.38 AM.png

There is a bunch of YG GC'ing going on... Might slow things some but not by much.  I've attached a screen shot.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837497#action_12837497 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

@Dave: Could you have at HBASE-2249 and confirm that the call to HTable.getScanner(...) is now being timed?

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841687#action_12841687 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

I think your suggestion is a good one, the race condition is really small, and holding up a client for just a few more microseconds should be reasonable.  Once we restructure to not put logs appends between memstore puts, we are literally talking about the speed of adding a few dozen entries in an array.  There is no data copy involved, since KeyValue was already read in during RPC time, and we are talking inserting small objects into a data structure.

I originally thought of being speedy about returning, but read your own writes does make this be an issue.  I'll add in your suggestions and put this test in as well.

Thanks for the great test!

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, readownwrites-lost.2.patch, readownwrites-lost.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838750#action_12838750 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

i have a prototype implementation of how to fix the atomic read without using locking or copying.  I'll put up a patch within a few days.  It's a little subtle, but put simply it uses sequential "Timestamps" to internally version the memstore so people know to ignore half written rows.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837067#action_12837067 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

K, with PerformanceEvaluation updates running "hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=1000 scanRange100 10" each scan takes on average 9ms to return a max of 100 rows (random data means they don't usually return 100 rows, average seemed to be around 70 rows).

The setup for that tests is as follows;
1 master
4 region servers (12GB heap)

1 million rows set up using:
hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 1

There were four regions all on one host.  Each region had roughly 40MB in the MemStore...

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837146#action_12837146 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

could you please tell me where your 4k of memory quote is coming from?

the clone() is a deep/shallow clone.  The KeyValues arent being cloned, but in ever other way the clone is a deep clone - it copies all the nodes!  That could be literally a million nodes!  The number of nodes is dependent on your data size... 64MB memstore can accomodate 1.3m values if your KeyValue size is ~ 50 bytes.  Or even larger if you start kicking in the memstore multiplier during a pending snapshot, you could have 4m+ nodes in a snapshot and a oversized kvset.  Clone is not really viable, it needs to be rolled back.  Furthermore it doesnt provide atomic protection anyways.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2248:
-------------------------

    Attachment: HBASE-2248-demonstrate-previous-impl-bugs.patch

Patch that restores memstore to how it was.  With this in place run memstore unit tests to see how old implementation was broke.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837141#action_12837141 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

@Todd: I didn't author the change but it relates to the [tests|http://svn.apache.org/viewvc/hadoop/hbase/branches/0.20/src/test/org/apache/hadoop/hbase/regionserver/TestHRegion.java?p2=/hadoop/hbase/branches/0.20/src/test/org/apache/hadoop/hbase/regionserver/TestHRegion.java&p1=/hadoop/hbase/branches/0.20/src/test/org/apache/hadoop/hbase/regionserver/TestHRegion.java&r1=896138&r2=896137&view=diff&pathrev=896138] added with the change.

@Ryan: The tests added to PE as a result of HBASE-2249 seem to indicate that even with a fully loaded MemStore it takes 9ms to complete a scan for ~100 rows with 10 concurrent client VMs hitting a single region server.  That seems to contradict the 1-2 seconds seen by Dave.   The thread dump does seems to indicate the clone but maybe something else is coming into play as well?  Maybe the additional 4KB memory allocation is bringing GC into it?

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Latham updated HBASE-2248:
-------------------------------

    Attachment: hbase-2248.gc

I've got gc logging enabled.  Here's a snapshot of the regionserver for a few minutes during which I ran this test 5 or 6 times and generated 360 short scans.  Let me know if there's any other GC info that would be useful.



> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841169#action_12841169 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

my patch passes all the new tests added by HBASE-2037 which focus on parallelism while doing scans.  

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837343#action_12837343 ] 

stack commented on HBASE-2248:
------------------------------

.bq Can anyone shed light on why HBASE-2037 introduced this clone in the first place? Seems like a totally braindead thing for performance. 

Mea Culpa. I should have caught this in review, the non-scalable, expensive full-copy.  Dumb.

I also should have run PE to catch degradation in performance before release though in this case, according to Dan, as PE is now, we'd not have caught the slowed-down memstore since we flush after each PE run and since the short-scan test is new with no history (Long time ago I wrote up a how-to-release: http://wiki.apache.org/hadoop/Hbase/HowToRelease.  It says PE required but I think I've not followed this receipe in a good while now).

.bq The 0.20.2 Memstore was using the ConcurrentSkipListMap#tailMap for every row. tailMap incurs an O(log) overhead when called on a ConcurrentSkipListMap so the total overhead of scanning the whole memstore in some cases, may be very close to the overhead of a complete sort of the KVs in memstore.

In the old implementation, we used to also make a copy of a row, everytime we called a next, to protect against the case where snapshot was removed out from under us.

.bq The scanner scans incorrectly when a snapshot exists

Why was this again?

.bq ... increased GC overhead on multiple concurrent scans

Dave, can you enable GC logging?  Even if this is the case, it needs to be addressed.

.bq Is it possible to avoid both 'partial puts' and cloning by 'timestamping' memstore records? e.g. each new KV in memstore gets a 'memstore timestamp' and when a scanner is created it grabs the current timestamp so that it knows to ignore KVs which entered the store after its creation? Should probably use a counter and not currentTimeMillis to ensure a clear-cut.

How would we snapshot such a thing?

We could add another ts/counter to KV.  We could do an AND on the type setting a bit if extra ts is present.  We then write out the KV as old style, dropping extra ts when we flush to hfile, or we just dump it all out.  System would need to be able to work with old-style KVs.  Comparator would be adjusted to accomodate new KV.   We'd do a tailset each time we made a scanner?  This would be a big change.  We should probably bump rpc version and require a restart of hbase cluster on upgrade.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837074#action_12837074 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

@JD: that would explain it...

With --nomapred (10 client threads in a single VM) each scan took 120-140ms...  

Also, the randomSeekScan test each scan seems VERY slow.  Each scan takes about 15 seconds...?  The scanRange100 uses a startRow and stopRow to get 100 rows back (well 70 rows).  The randomSeekScan using a "scan.setFilter(new WhileMatchFilter(new PageFilter(120)));".  What's up with that?

Oh, also, those tests are on the latest 0.20 branch (not on the 0.20.3 release)... 

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-2248:
-------------------------------

    Attachment: hbase-2248.txt

Here's a patch on top of Ryan's which implements the spin-wait. The concurrency test for read-own-writes now passes.

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, hbase-2248.txt, readownwrites-lost.2.patch, readownwrites-lost.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839964#action_12839964 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

I'm working on this, there is a general approach hammered out and code to be written.

The approach is like so:

- on read from memstore, for each row, we grab the 'read number' and ignore any keyvalues in the structure newer (ie: > value)
- on put to hregion/memstore, we start a write 'tx' and get a write-number, and put keyvalues with said write-number.  when we are finished, that write-number is 'commited' which causes the read number to be advanced most of the time.  under concurrent writes we have a little queue and slower puts may slightly hold up puts that come before it.  

this will need to be extensively tested to see how the performance profile changes. it will allow us to remove the newScannerLock.

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839050#action_12839050 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

deletes use tombstones, but the current GET code might need... adjustment to make it work. I'm working on a base fix which I will post soon and I'll also check the get implementation. 

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837202#action_12837202 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

Very good point...  

Even if the clone took the scan start and stop rows into account, there is still the possibility that only one or neither of them has been provided provided... 

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2248:
-------------------------

    Attachment: HBASE-2248.patch

Here is an attempt.  Tests pass.  Posting for review.  Need to do load tests yet.

"- Added a (transient) int updateId to KeyValue
- Memstore populates it on Adds and Deletes 
- When a MemstoreScanner is created it grabs the current id (actually increments  it to make sure no KV has that same id) and ignores records from kvset having an id greater than the one grabbed. Snapshots are scanned in full since they're not updated during the scanner's lifetime hence there's no risk of partial updates being visible.  There may be an issue with delete's becoming partly visible in this scheme, I'll check that later."


> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837645#action_12837645 ] 

stack commented on HBASE-2248:
------------------------------

@Yoram OK.  Maybe post patch here if thats possible so others can see old implementation was broke.  Good stuff.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836994#action_12836994 ] 

Dave Latham commented on HBASE-2248:
------------------------------------

After doing a flush on the table, the scans are about 100x faster.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Yoram Kulbak (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837591#action_12837591 ] 

Yoram Kulbak commented on HBASE-2248:
-------------------------------------

I did the following sanity check: I rolled back memstore to just before HBASE-2037 was applied [last commit on 21 Oct 2009]. 
[ To get things going I had to put back the MemStore#numKeyValues method and change the  MemStore#clearSnapshot   argument to SortedSet ]

I then ran TestHRegion and two tests failed:
- testFlushCacheWhileScanning - demonstrates the incorrect scans while a snapshot exists issue
- testWritesWhileScanning - demonstrates 'partial puts' being visible to the scanner
I also tried running TestMemStore but all the tests there have passed. I didn't try running the whole suite.

It took me a while to figure out what exactly goes wrong when a snapshot exists, the short (and vague) explanation is that the scanner may return keys in a 'non ordered' manner, meaning a KV with a higher row  may be returned before a KV with a lower row because the result list which aggregates results from both snapshot and kvset doesn't guarantee the KVs are added in a sorted order. I think there's a way to add a simple test to TestMemStore which will demonstrate that..   



> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837069#action_12837069 ] 

Jean-Daniel Cryans commented on HBASE-2248:
-------------------------------------------

bq. 1 million rows set up using:

With randomWrite you don't write 1M rows (more like ~700,000 IIRC) so that explains why your scans aren't always of 100 rows.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837127#action_12837127 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

I had a look at the implementation of clone, and it is really not appropriate for what we are doing.

I would like to open up discussions to revert the original patch.  I would argue there has been too many lurking issues, and the additional functionality, while useful, doesnt justify crippling performance.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Latham updated HBASE-2248:
-------------------------------

    Attachment: threads.txt

Here's some example threads from a dump.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dave Latham (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837519#action_12837519 ] 

Dave Latham commented on HBASE-2248:
------------------------------------

@Dan: Took a read over the patch, though it seemed to be based in a different dir and didn't want to apply nicely.  From what I can see the ScanTest still does getScanner in testSetup before the timer is begun.  This may be fine, if the point of this test is to measure scan performance per-row and not setup/teardown time.  It just explains why the ScanTest doesn't exhibit this issue.  It does look like other tests, such as the RandomSeekScanTest and the new RandomScanWithRangeTest do test setup/teardown time as part of each "testRow" and should exhibit this issue, if run.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837126#action_12837126 ] 

Todd Lipcon commented on HBASE-2248:
------------------------------------

Can anyone shed light on why HBASE-2037 introduced this clone in the first place? Seems like a totally braindead thing for performance.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-2248:
-------------------------------

    Attachment: readownwrites-lost.patch

Here's a test case patch (on top of yours) which should illustrate the issue. It fails every time for me on a dual core box:

Didnt read own writes expected:<395> but was:<394>
junit.framework.AssertionFailedError: Didnt read own writes expected:<395> but was:<394>
        at org.apache.hadoop.hbase.regionserver.TestMemStore$ReadOwnWritesTester.internalRun(TestMemStore.java:293)
        at org.apache.hadoop.hbase.regionserver.TestMemStore$ReadOwnWritesTester.run(TestMemStore.java:268)


> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, readownwrites-lost.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2248:
-------------------------

    Comment: was deleted

(was: @Yoram OK.  Maybe post patch here if thats possible so others can see old implementation was broke.  Good stuff.)

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837105#action_12837105 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

Cloning the MemStore based on the scan.startRow and scan.stopRow drops the scan times from ~9ms per scan to ~3ms per scan on the above hardware...

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837523#action_12837523 ] 

ryan rawson commented on HBASE-2248:
------------------------------------

done properly, a timestamp oriented fix to version memstore should not require any RPC version bump, its all internal. 

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: hbase-2248.gc, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-2248:
-------------------------------

    Attachment: readownwrites-lost.2.patch

Here's a slightly better test patch, much more sure to fail.

(this test could easily be written without multiple threads, but as an illustration of the client's view of the consistency, the threads are useful)

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, readownwrites-lost.2.patch, readownwrites-lost.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841585#action_12841585 ] 

Todd Lipcon commented on HBASE-2248:
------------------------------------

bq. IMHO eliminating the functional differences between gets and scans will be a change for the better but perhaps there are existing users which rely those subtle differences

+1 for eliminating the differences. If people are relying on broken behavior, they should fix their applications ;-) HBase is not 1.0; let's pick sanity over compatibility.

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841669#action_12841669 ] 

Todd Lipcon commented on HBASE-2248:
------------------------------------

Hey Ryan

I looked over this patch a bit this afternoon. It's clever but I think it can result in loss of read-your-own-writes consistency for a single client. Consider this scenario:

|| Action || Read # || Write # || memstoreRead || memstoreWrite ||
| Client A begins a put on row R   | - | 1 | 0 | 1 |
| Client B begins a put on row S   | - | 2| 0 | 2 |
| Client B finishes a put on row S | - | - | 0 | 2 |
| Client B initiates a get on row S | 0 | - | 0 | 2 |

So, since client A's put #1 is still ongoing on a separate row, client B is unable to read version #2 of its row.

I think dropping consistency below read-your-own-writes is bad, even though it's rare that the above situation would occur. Under high throughput I think it's possible to occur, and it could be very very bad if people are relying on this level of consistency to implement transactions, etc.

One possible solution is that completeMemstoreInsert can spin until memstoreRead >= e.getWriteNumber(). Given that it only has to wait for other concurrent writers to finish, a spin on memstoreRead.get() should only go a few cycles and actually be reasonably efficient.

I'll think a bit about whether there are other possible solutions.

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-ryan.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839248#action_12839248 ] 

stack commented on HBASE-2248:
------------------------------

Yeah, if client adds new edit w/ exact same ts and the comparator used by memstore does not take sequenceid into consideration, we'll have issues Todd identifies.  Perhaps change the Comparator used by MemStore to consider sequenceid?   Also missing from patch is enforcement of the fact that on flush, the flush file has deletes that apply to older files only -- not to current flush file content.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837002#action_12837002 ] 

Dan Washusen commented on HBASE-2248:
-------------------------------------

I notice the performance evaluation flushes the table after each test completes, as a result none of the read tests take the memstore into account.  Maybe the PerformanceEvaluation class could be changed to make the flush optional?

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Yoram Kulbak (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837281#action_12837281 ] 

Yoram Kulbak commented on HBASE-2248:
-------------------------------------

Ryan:
The 4K quote is my mistake, based on a non-typical HBASE usage (small memstore, large KVs).
Cloning is definitely bad. It's only benefit is that it allows the scan to be isolated from on-going writes; HRegion#newScannerLock takes care of writes not coming in while the scanner is created, so 0.20.3 unlike 0.20.2 does provide protection from 'partial puts' if this was what you're implying by 'atomic protection'. There is also a test added to TestHRegion which verifies that. 

I'm not sure that rollback is a viable option:  
The 0.20.2 Memstore was using the ConcurrentSkipListMap#tailMap for every row. tailMap incurs an O(log(n)) overhead when called on a ConcurrentSkipListMap so the total overhead of scanning the whole memstore in some cases, may be very close to the overhead of a complete sort of the KVs in memstore.
The 0.20.2 MemStore and MemStoreScanner are also functionally incorrect since  
- The scanner may observe a 'partial put' (not atomically protected) 
- The scanner scans incorrectly when a snapshot exists    

since we observed a considerable 'single scan' performance improvement using the new MemStore implementation could the performance hit stem from increased GC overhead on multiple concurrent scans?   
Note that with 0.20.2 we observed that MemStoreScanner is running slower than StoreFileScanner..  

Is it possible to avoid both 'partial puts' and cloning by 'timestamping' memstore records? e.g. each new KV in memstore gets a 'memstore timestamp' and when a scanner is created it grabs the current timestamp so that it knows to ignore KVs which entered the store after its creation?  Should probably use a counter and not currentTimeMillis to ensure a clear-cut. 

------------
About those ~50 byte KVs, according to my calcs:
KeyLength: 4 bytes
ValueLength: 4 bytes
rowLength: 2 bytes
FamilyLength: 1 byte
TimeStamp: 8 bytes
Type: 1 byte

There are 20 bytes of overhead to start with.
Adding an average of 10 bytes for the column and qualifier brings it to 40 bytes. 
This leaves 10 bytes (out of 50) for the row + value. Meaning 80% of the storage is overhead.
My point is that if ~50b KVs are the common use-case  some optimization needs to be made to the way things are stored.
Perhaps you meant 50b for row+value?



> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-2248) Provide new non-copy mechanism to assure atomic reads in get and scan

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-2248.
--------------------------

    Hadoop Flags: [Incompatible change, Reviewed]
    Release Note: This patch changes the Get code path to instead be a Scan of one row.  This means than inserting cells out of timestamp order should work now (tests to verify to follow part of hbase-2294) but also that a delete at an explicit timestamp now overshadows EVEN if the effected cell is put after the delete (The old Get code path did early-out so a subsequent puts would not see the delete).
      Resolution: Fixed

Thanks all who contributed to this issue: Todd, Dan, Yoram and in particular Ryan.

> Provide new non-copy mechanism to assure atomic reads in get and scan
> ---------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, HBASE-2248-GetsAsScans3.patch, HBASE-2248-rr-alpha3.txt, HBASE-2248-rr-pre-durability2.txt, HBASE-2248-rr-pre-durability3.txt, HBASE-2248-rr-pre-durability4.txt, hbase-2248.gc, HBASE-2248.patch, hbase-2248.txt, profile.png, put_call_graph.png, readownwrites-lost.2.patch, readownwrites-lost.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-2248) New MemStoreScanner copies memstore for each scan, makes short scans slow

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838919#action_12838919 ] 

Todd Lipcon commented on HBASE-2248:
------------------------------------

bq. There may be an issue with delete's becoming partly visible in this scheme

I would think so - deletes in the memstore don't use tombstones, do they? Similarly for updates - if you update a row, its internal ts will update and the scanner will no longer see the old version either.

> New MemStoreScanner copies memstore for each scan, makes short scans slow
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2248
>                 URL: https://issues.apache.org/jira/browse/HBASE-2248
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.3
>            Reporter: Dave Latham
>             Fix For: 0.20.4
>
>         Attachments: HBASE-2248-demonstrate-previous-impl-bugs.patch, hbase-2248.gc, HBASE-2248.patch, Screen shot 2010-02-23 at 10.33.38 AM.png, threads.txt
>
>
> HBASE-2037 introduced a new MemStoreScanner which triggers a ConcurrentSkipListMap.buildFromSorted clone of the memstore and snapshot when starting a scan.
> After upgrading to 0.20.3, we noticed a big slowdown in our use of short scans.  Some of our data repesent a time series.   The data is stored in time series order, MR jobs often insert/update new data at the end of the series, and queries usually have to pick up some or all of the series.  These are often scans of 0-100 rows at a time.  To load one page, we'll observe about 20 such scans being triggered concurrently, and they take 2 seconds to complete.  Doing a thread dump of a region server shows many threads in ConcurrentSkipListMap.biuldFromSorted which traverses the entire map of key values to copy it.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.