You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Varun Sharma <va...@pinterest.com> on 2014/01/26 22:14:10 UTC

Sporadic memstore slowness for Read Heavy workloads

Hi,

We are seeing some unfortunately low performance in the memstore - we have
researched some of the previous JIRA(s) and seen some inefficiencies in the
ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
weird points in time - the bug is hard to reproduce and there isn't like a
huge # of extra reads going to that region server or any substantial
hotspot happening. The region server recovers the moment, we flush the
memstores or restart the region server. Our queries retrieve wide rows
which are upto 10-20 columns. A stack trace shows two things:

1) Time spent inside MemstoreScanner.reseek() and inside the
ConcurrentSkipListMap
2) The reseek() is being called at the "SEEK_NEXT" column inside
StoreScanner - this is understandable since the rows contain many columns
and StoreScanner iterates one KeyValue at a time.

So, I was looking at the code and it seems that every single time there is
a reseek call on the same memstore scanner, we make a fresh call to build
an iterator() on the skip list set - this means we an additional skip list
lookup for every column retrieved. SkipList lookups are O(n) and not O(1).

Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
if exceeded, do a lookup. However, it seems this behaviour was reverted by
HBASE 4195 and every next row/next column is now a reseek() and a skip list
lookup rather than being an iterator.

Are there any strong reasons against having the previous behaviour of
scanning a small # of keys before degenerating to a skip list lookup ?
Seems like it would really help for sequential memstore scans and for
memstore gets with wide tables (even 10-20 columns).

Thanks
Varun

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Ted Yu <yu...@gmail.com>.
On a second look, this is the bold line:

org.apache.hadoop.hbase.regionserver.StoreScanner.
next(StoreScanner.java:411)*


On Sun, Jan 26, 2014 at 1:43 PM, Ted Yu <yu...@gmail.com> wrote:

> I don't see bold line.
> Is the following line what you referred to ?
> org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.reseek(
> MemStore.java:805)
>
> BTW I assume you're using 0.94 HBase in production.
>
> Cheers
>
>
> On Sun, Jan 26, 2014 at 1:17 PM, Varun Sharma <va...@pinterest.com> wrote:
>
>> Here is the hotthread result. The line in bold is "SEEK_NEXT_COL" and it
>> degenerates to a reseek+skiplist lookup when it does not have to because
>> we
>> already seeked to the row and getting all the columns is just a matter of
>> iterator.next() calls.
>>
>>
>>
>> org.apache.hadoop.hbase.KeyValue$KeyComparator.compareWithoutRow(KeyValue.java:2196)
>>
>> org.apache.hadoop.hbase.KeyValue$KeyComparator.compare(KeyValue.java:2086)
>>
>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1535)
>>
>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1523)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap.findNear(ConcurrentSkipListMap.java:1345)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap$SubMap.loNode(ConcurrentSkipListMap.java:2583)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap$SubMap.access$300(ConcurrentSkipListMap.java:2486)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapIter.<init>(ConcurrentSkipListMap.java:3022)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapValueIterator.<init>(ConcurrentSkipListMap.java:3092)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap$SubMap.valueIterator(ConcurrentSkipListMap.java:3002)
>>
>>
>> java.util.concurrent.ConcurrentSkipListMap$Values.iterator(ConcurrentSkipListMap.java:2402)
>>
>>
>> org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet.iterator(KeyValueSkipListSet.java:87)
>>
>>
>> org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.reseek(MemStore.java:805)
>>
>>
>> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54)
>>
>>
>> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:320)
>>
>>
>> org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:265)
>>
>>
>> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:545)
>>
>> *
>>
>> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:411)*
>>
>> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:143)
>>
>>
>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3884)
>>
>>
>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3956)
>>
>>
>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3827)
>>
>>
>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3808)
>>
>>
>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3851)
>>     org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4777)
>>     org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4750)
>>
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2152)
>>
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3700)
>>
>>
>>
>> On Sun, Jan 26, 2014 at 1:14 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi,
>> >
>> > We are seeing some unfortunately low performance in the memstore - we
>> have
>> > researched some of the previous JIRA(s) and seen some inefficiencies in
>> the
>> > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
>> at
>> > weird points in time - the bug is hard to reproduce and there isn't
>> like a
>> > huge # of extra reads going to that region server or any substantial
>> > hotspot happening. The region server recovers the moment, we flush the
>> > memstores or restart the region server. Our queries retrieve wide rows
>> > which are upto 10-20 columns. A stack trace shows two things:
>> >
>> > 1) Time spent inside MemstoreScanner.reseek() and inside the
>> > ConcurrentSkipListMap
>> > 2) The reseek() is being called at the "SEEK_NEXT" column inside
>> > StoreScanner - this is understandable since the rows contain many
>> columns
>> > and StoreScanner iterates one KeyValue at a time.
>> >
>> > So, I was looking at the code and it seems that every single time there
>> is
>> > a reseek call on the same memstore scanner, we make a fresh call to
>> build
>> > an iterator() on the skip list set - this means we an additional skip
>> list
>> > lookup for every column retrieved. SkipList lookups are O(n) and not
>> O(1).
>> >
>> > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
>> number
>> > if exceeded, do a lookup. However, it seems this behaviour was reverted
>> by
>> > HBASE 4195 and every next row/next column is now a reseek() and a skip
>> list
>> > lookup rather than being an iterator.
>> >
>> > Are there any strong reasons against having the previous behaviour of
>> > scanning a small # of keys before degenerating to a skip list lookup ?
>> > Seems like it would really help for sequential memstore scans and for
>> > memstore gets with wide tables (even 10-20 columns).
>> >
>> > Thanks
>> > Varun
>> >
>>
>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Ted Yu <yu...@gmail.com>.
I don't see bold line.
Is the following line what you referred to ?
org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.reseek(
MemStore.java:805)

BTW I assume you're using 0.94 HBase in production.

Cheers


On Sun, Jan 26, 2014 at 1:17 PM, Varun Sharma <va...@pinterest.com> wrote:

> Here is the hotthread result. The line in bold is "SEEK_NEXT_COL" and it
> degenerates to a reseek+skiplist lookup when it does not have to because we
> already seeked to the row and getting all the columns is just a matter of
> iterator.next() calls.
>
>
>
> org.apache.hadoop.hbase.KeyValue$KeyComparator.compareWithoutRow(KeyValue.java:2196)
>
> org.apache.hadoop.hbase.KeyValue$KeyComparator.compare(KeyValue.java:2086)
>
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1535)
>
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1523)
>
>
> java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)
>
>
> java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)
>
>
> java.util.concurrent.ConcurrentSkipListMap.findNear(ConcurrentSkipListMap.java:1345)
>
>
> java.util.concurrent.ConcurrentSkipListMap$SubMap.loNode(ConcurrentSkipListMap.java:2583)
>
>
> java.util.concurrent.ConcurrentSkipListMap$SubMap.access$300(ConcurrentSkipListMap.java:2486)
>
>
> java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapIter.<init>(ConcurrentSkipListMap.java:3022)
>
>
> java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapValueIterator.<init>(ConcurrentSkipListMap.java:3092)
>
>
> java.util.concurrent.ConcurrentSkipListMap$SubMap.valueIterator(ConcurrentSkipListMap.java:3002)
>
>
> java.util.concurrent.ConcurrentSkipListMap$Values.iterator(ConcurrentSkipListMap.java:2402)
>
>
> org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet.iterator(KeyValueSkipListSet.java:87)
>
>
> org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.reseek(MemStore.java:805)
>
>
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54)
>
>
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:320)
>
>
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:265)
>
>
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:545)
>
> *
>
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:411)*
>
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:143)
>
>
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3884)
>
>
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3956)
>
>
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3827)
>
>
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3808)
>
>
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3851)
>     org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4777)
>     org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4750)
>
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2152)
>
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3700)
>
>
> On Sun, Jan 26, 2014 at 1:14 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi,
> >
> > We are seeing some unfortunately low performance in the memstore - we
> have
> > researched some of the previous JIRA(s) and seen some inefficiencies in
> the
> > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> > weird points in time - the bug is hard to reproduce and there isn't like
> a
> > huge # of extra reads going to that region server or any substantial
> > hotspot happening. The region server recovers the moment, we flush the
> > memstores or restart the region server. Our queries retrieve wide rows
> > which are upto 10-20 columns. A stack trace shows two things:
> >
> > 1) Time spent inside MemstoreScanner.reseek() and inside the
> > ConcurrentSkipListMap
> > 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > StoreScanner - this is understandable since the rows contain many columns
> > and StoreScanner iterates one KeyValue at a time.
> >
> > So, I was looking at the code and it seems that every single time there
> is
> > a reseek call on the same memstore scanner, we make a fresh call to build
> > an iterator() on the skip list set - this means we an additional skip
> list
> > lookup for every column retrieved. SkipList lookups are O(n) and not
> O(1).
> >
> > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> number
> > if exceeded, do a lookup. However, it seems this behaviour was reverted
> by
> > HBASE 4195 and every next row/next column is now a reseek() and a skip
> list
> > lookup rather than being an iterator.
> >
> > Are there any strong reasons against having the previous behaviour of
> > scanning a small # of keys before degenerating to a skip list lookup ?
> > Seems like it would really help for sequential memstore scans and for
> > memstore gets with wide tables (even 10-20 columns).
> >
> > Thanks
> > Varun
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Here is the hotthread result. The line in bold is "SEEK_NEXT_COL" and it
degenerates to a reseek+skiplist lookup when it does not have to because we
already seeked to the row and getting all the columns is just a matter of
iterator.next() calls.


org.apache.hadoop.hbase.KeyValue$KeyComparator.compareWithoutRow(KeyValue.java:2196)

org.apache.hadoop.hbase.KeyValue$KeyComparator.compare(KeyValue.java:2086)

org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1535)

org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1523)

java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)

java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)

java.util.concurrent.ConcurrentSkipListMap.findNear(ConcurrentSkipListMap.java:1345)

java.util.concurrent.ConcurrentSkipListMap$SubMap.loNode(ConcurrentSkipListMap.java:2583)

java.util.concurrent.ConcurrentSkipListMap$SubMap.access$300(ConcurrentSkipListMap.java:2486)

java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapIter.<init>(ConcurrentSkipListMap.java:3022)

java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapValueIterator.<init>(ConcurrentSkipListMap.java:3092)

java.util.concurrent.ConcurrentSkipListMap$SubMap.valueIterator(ConcurrentSkipListMap.java:3002)

java.util.concurrent.ConcurrentSkipListMap$Values.iterator(ConcurrentSkipListMap.java:2402)

org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet.iterator(KeyValueSkipListSet.java:87)

org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.reseek(MemStore.java:805)

org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54)

org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:320)

org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:265)

org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:545)

*
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:411)*
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:143)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3884)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3956)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3827)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3808)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3851)
    org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4777)
    org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4750)

org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2152)

org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3700)


On Sun, Jan 26, 2014 at 1:14 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> We are seeing some unfortunately low performance in the memstore - we have
> researched some of the previous JIRA(s) and seen some inefficiencies in the
> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> weird points in time - the bug is hard to reproduce and there isn't like a
> huge # of extra reads going to that region server or any substantial
> hotspot happening. The region server recovers the moment, we flush the
> memstores or restart the region server. Our queries retrieve wide rows
> which are upto 10-20 columns. A stack trace shows two things:
>
> 1) Time spent inside MemstoreScanner.reseek() and inside the
> ConcurrentSkipListMap
> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> StoreScanner - this is understandable since the rows contain many columns
> and StoreScanner iterates one KeyValue at a time.
>
> So, I was looking at the code and it seems that every single time there is
> a reseek call on the same memstore scanner, we make a fresh call to build
> an iterator() on the skip list set - this means we an additional skip list
> lookup for every column retrieved. SkipList lookups are O(n) and not O(1).
>
> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
> if exceeded, do a lookup. However, it seems this behaviour was reverted by
> HBASE 4195 and every next row/next column is now a reseek() and a skip list
> lookup rather than being an iterator.
>
> Are there any strong reasons against having the previous behaviour of
> scanning a small # of keys before degenerating to a skip list lookup ?
> Seems like it would really help for sequential memstore scans and for
> memstore gets with wide tables (even 10-20 columns).
>
> Thanks
> Varun
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Here is the hotthread result. The line in bold is "SEEK_NEXT_COL" and it
degenerates to a reseek+skiplist lookup when it does not have to because we
already seeked to the row and getting all the columns is just a matter of
iterator.next() calls.


org.apache.hadoop.hbase.KeyValue$KeyComparator.compareWithoutRow(KeyValue.java:2196)

org.apache.hadoop.hbase.KeyValue$KeyComparator.compare(KeyValue.java:2086)

org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1535)

org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1523)

java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)

java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)

java.util.concurrent.ConcurrentSkipListMap.findNear(ConcurrentSkipListMap.java:1345)

java.util.concurrent.ConcurrentSkipListMap$SubMap.loNode(ConcurrentSkipListMap.java:2583)

java.util.concurrent.ConcurrentSkipListMap$SubMap.access$300(ConcurrentSkipListMap.java:2486)

java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapIter.<init>(ConcurrentSkipListMap.java:3022)

java.util.concurrent.ConcurrentSkipListMap$SubMap$SubMapValueIterator.<init>(ConcurrentSkipListMap.java:3092)

java.util.concurrent.ConcurrentSkipListMap$SubMap.valueIterator(ConcurrentSkipListMap.java:3002)

java.util.concurrent.ConcurrentSkipListMap$Values.iterator(ConcurrentSkipListMap.java:2402)

org.apache.hadoop.hbase.regionserver.KeyValueSkipListSet.iterator(KeyValueSkipListSet.java:87)

org.apache.hadoop.hbase.regionserver.MemStore$MemStoreScanner.reseek(MemStore.java:805)

org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:54)

org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:320)

org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:265)

org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:545)

*
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:411)*
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:143)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3884)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3956)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3827)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3808)

org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3851)
    org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4777)
    org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4750)

org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2152)

org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3700)


On Sun, Jan 26, 2014 at 1:14 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> We are seeing some unfortunately low performance in the memstore - we have
> researched some of the previous JIRA(s) and seen some inefficiencies in the
> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> weird points in time - the bug is hard to reproduce and there isn't like a
> huge # of extra reads going to that region server or any substantial
> hotspot happening. The region server recovers the moment, we flush the
> memstores or restart the region server. Our queries retrieve wide rows
> which are upto 10-20 columns. A stack trace shows two things:
>
> 1) Time spent inside MemstoreScanner.reseek() and inside the
> ConcurrentSkipListMap
> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> StoreScanner - this is understandable since the rows contain many columns
> and StoreScanner iterates one KeyValue at a time.
>
> So, I was looking at the code and it seems that every single time there is
> a reseek call on the same memstore scanner, we make a fresh call to build
> an iterator() on the skip list set - this means we an additional skip list
> lookup for every column retrieved. SkipList lookups are O(n) and not O(1).
>
> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
> if exceeded, do a lookup. However, it seems this behaviour was reverted by
> HBASE 4195 and every next row/next column is now a reseek() and a skip list
> lookup rather than being an iterator.
>
> Are there any strong reasons against having the previous behaviour of
> scanning a small # of keys before degenerating to a skip list lookup ?
> Seems like it would really help for sequential memstore scans and for
> memstore gets with wide tables (even 10-20 columns).
>
> Thanks
> Varun
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by lars hofhansl <la...@apache.org>.
The confusion comes from the delete marker placement.

When you mark all columns for deletion a "family" delete marker is placed and will always sorted at the *beginning* the row, before all other KV for the row in that column family. (Note that a row is marked for delete by placing a family delete marker in each column family)


The logic is that scanner can determine what KVs have been marked for deletion while only scanning forward through the hfiles/memstores.
A family delete marker marks all versions of all columns for deletion so it has to come first.
That also means that any get/scan request has to first seek to the beginning of the row to see if there is a delete marker.


You example would look like this in the end:
(ROW, <DELETE_FAMILY>, T=2)
(ROW, COL1, T=3)
(ROW, COL2, T=3)
(ROW, COL3, T=3)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)


Note that there are two other delete marker types: column and version. These are sorted according to their versions right before the columns/versions they affect.


Does this blog post help: htthp://hadoop-hbase.blogspot.com/2011/12/deletion-in-hbase.html ?

-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Cc: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Tuesday, January 28, 2014 9:04 AM
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
 

Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
because COL2 > COL1 lexicographically. However in the above example, it
comes before the delete marker and hence before (ROW, COL1, T=1) which is
wrong, no ?



On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. Now, clearly there will be columns above the delete marker which are
> smaller than the ones below it.
>
> This is where closer look is needed. Part of the confusion arises from
> usage of > and < in your example.
> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>
> Here, in terms of sort order, 'above' means before. 'below it' would mean
> after. So 'smaller' would mean before.
>
> Cheers
>
>
> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi Ted,
> >
> > Not satisfied with your answer, the document you sent does not talk about
> > Delete ColumnFamily marker sort order. For the delete family marker to
> > work, it has to mask *all* columns of a family. Hence it has to be above
> > all the older columns. All the new columns must come above this column
> > family delete marker. Now, clearly there will be columns above the delete
> > marker which are smaller than the ones below it.
> >
> > The document talks nothing about delete marker order, could you answer
> the
> > question by looking through the example ?
> >
> > Varun
> >
> >
> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Varun:
> > > Take a look at http://hbase.apache.org/book.html#dm.sort
> > >
> > > There's no contradiction.
> > >
> > > Cheers
> > >
> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> > >
> > > > Actually, I now have another question because of the way our work
> load
> > is
> > > > structured. We use a wide schema and each time we write, we delete
> the
> > > > entire row and write a fresh set of columns - we want to make sure no
> > old
> > > > columns survive. So, I just want to see if my picture of the memstore
> > at
> > > > this point is correct or not. My understanding is that Memstore is
> > > > basically a skip list of keyvalues and compares the values using
> > KeyValue
> > > > comparator
> > > >
> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> > > >
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > > > memstore has - my understanding is that we do not delete in the
> > memstore
> > > > but only add markers:
> > > >
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted
> > > above
> > > > the delete.
> > > >
> > > > (ROW, COL1, T=3)
> > > > (ROW, COL2, T=3)
> > > > (ROW, COL3, T=3)
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > This is the ideal scenario for the data to be correctly reflected.
> > > >
> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> > hence,
> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> > > >
> > > > But, we also know that KeyValues compare first by ROW, then by Column
> > and
> > > > then by timestamp in reverse order
> > > >
> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> > > >
> > > > This seems to be contradicting and my main worry is that in a skip
> > list,
> > > it
> > > > is quite possible for skipping to happen as you go through the high
> > level
> > > > express lanes and it could be possible for one of these entries to
> > never
> > > > actually even see the delete marker. For example consider the case
> > above
> > > > where entry #1 and entry #5 form the higher level of the skip list
> and
> > > the
> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
> > and
> > > it
> > > > could end up in the wrong location.
> > > >
> > > > Obviously, if we cleanse all the row contents when a get a ROW level
> > > delete
> > > > marker, we are fine but I want to know if that is the case. If we are
> > not
> > > > really cleansing all the row contents when we get a ROW level delete
> > > > marker, then I want to know why the above scenario can not lead to
> bugs
> > > :)
> > > >
> > > > Varun
> > > >
> > > >
> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > > > <vl...@gmail.com>wrote:
> > > >
> > > >> Varun,
> > > >>
> > > >> There is no need to open new JIRA - there are two already:
> > > >> https://issues.apache.org/jira/browse/HBASE-9769
> > > >> https://issues.apache.org/jira/browse/HBASE-9778
> > > >>
> > > >> Both with patches, you can grab and test them.
> > > >>
> > > >> -Vladimir
> > > >>
> > > >>
> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> > > wrote:
> > > >>
> > > >>> Hi lars,
> > > >>>
> > > >>> Thanks for the background. It seems that for our case, we will have
> > to
> > > >>> consider some solution like the Facebook one, since the next column
> > is
> > > >>> always the next one - this can be a simple flag. I am going to
> raise
> > a
> > > >> JIRA
> > > >>> and we can discuss there.
> > > >>>
> > > >>> Thanks
> > > >>> Varun
> > > >>>
> > > >>>
> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > >>>
> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
> chime
> > in
> > > >>>> soon. :)
> > > >>>>
> > > >>>> Reseek is expensive compared to next if next would get us the KV
> > we're
> > > >>>> looking for. However, HBase does not know that ahead of time.
> There
> > > >> might
> > > >>>> be a 1000 versions of the previous KV to be skipped first.
> > > >>>> HBase seeks in three situation:
> > > >>>> 1. Seek to the next column (there might be a lot of versions to
> > skip)
> > > >>>> 2. Seek to the next row (there might be a lot of versions and
> other
> > > >>>> columns to skip)
> > > >>>> 3. Seek to a row via a hint
> > > >>>>
> > > >>>> #3 is definitely useful, with that one can implement very
> efficient
> > > >> "skip
> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > > >>>> #2 is helpful if there are many columns and one only "selects" a
> few
> > > >> (and
> > > >>>> of course also if there are many versions of columns)
> > > >>>> #1 is only helpful when we expect there to be many versions. Or of
> > the
> > > >>>> size of a typical KV aproaches the block size, since then we'd
> need
> > to
> > > >>> seek
> > > >>>> to the find the next block anyway.
> > > >>>>
> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or
> is
> > > >> that
> > > >>>> just the number of column you return?
> > > >>>>
> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> > instruct
> > > >>> the
> > > >>>> scanner to not seek to the next column or the next row, but just
> > issue
> > > >>>> next()'s until the KV is found. Another suggested approach (I
> think
> > by
> > > >>> the
> > > >>>> Facebook guys) was to issue next() opportunistically a few times,
> > and
> > > >>> only
> > > >>>> when that did not get us to ther requested KV issue a reseek.
> > > >>>> I have also thought of a near/far designation of seeks. For near
> > seeks
> > > >>>> we'd do a configurable number of next()'s first, then seek.
> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> > > >>>>
> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > > >>>>
> > > >>>> I'll look at the trace a bit closers.
> > > >>>> So far my scan profiling has been focused on data in the
> blockcache
> > > >> since
> > > >>>> in the normal case the vast majority of all data is found there
> and
> > > >> only
> > > >>>> recent changes are in the memstore.
> > > >>>>
> > > >>>> -- Lars
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> ________________________________
> > > >>>> From: Varun Sharma <va...@pinterest.com>
> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > > >>> dev@hbase.apache.org"
> > > >>>> <de...@hbase.apache.org>
> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> > > >>>>
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are seeing some unfortunately low performance in the memstore -
> > we
> > > >>> have
> > > >>>> researched some of the previous JIRA(s) and seen some
> inefficiencies
> > > in
> > > >>> the
> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
> > cpu
> > > >> at
> > > >>>> weird points in time - the bug is hard to reproduce and there
> isn't
> > > >> like
> > > >>> a
> > > >>>> huge # of extra reads going to that region server or any
> substantial
> > > >>>> hotspot happening. The region server recovers the moment, we flush
> > the
> > > >>>> memstores or restart the region server. Our queries retrieve wide
> > rows
> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
> > > >>>>
> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> > > >>>> ConcurrentSkipListMap
> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > > >>>> StoreScanner - this is understandable since the rows contain many
> > > >> columns
> > > >>>> and StoreScanner iterates one KeyValue at a time.
> > > >>>>
> > > >>>> So, I was looking at the code and it seems that every single time
> > > there
> > > >>> is
> > > >>>> a reseek call on the same memstore scanner, we make a fresh call
> to
> > > >> build
> > > >>>> an iterator() on the skip list set - this means we an additional
> > skip
> > > >>> list
> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
> not
> > > >>> O(1).
> > > >>>>
> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
> that
> > > >>> number
> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> > > reverted
> > > >>> by
> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
> > skip
> > > >>> list
> > > >>>> lookup rather than being an iterator.
> > > >>>>
> > > >>>> Are there any strong reasons against having the previous behaviour
> > of
> > > >>>> scanning a small # of keys before degenerating to a skip list
> > lookup ?
> > > >>>> Seems like it would really help for sequential memstore scans and
> > for
> > > >>>> memstore gets with wide tables (even 10-20 columns).
> > > >>>>
> > > >>>> Thanks
> > > >>>> Varun
> > > >>
> > >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by lars hofhansl <la...@apache.org>.
I see you figured it out. I should read all email before I sent my last reply.



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Cc: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Tuesday, January 28, 2014 9:43 AM
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
 


Ohk I think I understand this better now. So the order will actually be, something like this, at step #3

(ROW, <DELETE>, T=2)
(ROW, COL1, T=3)
(ROW, COL1, T=1)  - filtered

(ROW, COL2, T=3)
(ROW, COL2, T=1)  - filtered
(ROW, COL3, T=3)
(ROW, COL3, T=1)  - filtered

The ScanDeleteTracker class would simply filter out columns which have a timestamp < 2.

Varun



On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <va...@pinterest.com> wrote:

Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1) because COL2 > COL1 lexicographically. However in the above example, it comes before the delete marker and hence before (ROW, COL1, T=1) which is wrong, no ?
>
>
>
>On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:
>
>bq. Now, clearly there will be columns above the delete marker which are
>>
>>smaller than the ones below it.
>>
>>This is where closer look is needed. Part of the confusion arises from
>>usage of > and < in your example.
>>(ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>>
>>Here, in terms of sort order, 'above' means before. 'below it' would mean
>>after. So 'smaller' would mean before.
>>
>>Cheers
>>
>>
>>
>>On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:
>>
>>> Hi Ted,
>>>
>>> Not satisfied with your answer, the document you sent does not talk about
>>> Delete ColumnFamily marker sort order. For the delete family marker to
>>> work, it has to mask *all* columns of a family. Hence it has to be above
>>> all the older columns. All the new columns must come above this column
>>> family delete marker. Now, clearly there will be columns above the delete
>>> marker which are smaller than the ones below it.
>>>
>>> The document talks nothing about delete marker order, could you answer the
>>> question by looking through the example ?
>>>
>>> Varun
>>>
>>>
>>> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> > Varun:
>>> > Take a look at http://hbase.apache.org/book.html#dm.sort
>>> >
>>> > There's no contradiction.
>>> >
>>> > Cheers
>>> >
>>> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>>> >
>>> > > Actually, I now have another question because of the way our work load
>>> is
>>> > > structured. We use a wide schema and each time we write, we delete the
>>> > > entire row and write a fresh set of columns - we want to make sure no
>>> old
>>> > > columns survive. So, I just want to see if my picture of the memstore
>>> at
>>> > > this point is correct or not. My understanding is that Memstore is
>>> > > basically a skip list of keyvalues and compares the values using
>>> KeyValue
>>> > > comparator
>>> > >
>>> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
>>> > >
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
>>> > > memstore has - my understanding is that we do not delete in the
>>> memstore
>>> > > but only add markers:
>>> > >
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 3) Now we write our new fresh row for *T=3* - it should get inserted
>>> > above
>>> > > the delete.
>>> > >
>>> > > (ROW, COL1, T=3)
>>> > > (ROW, COL2, T=3)
>>> > > (ROW, COL3, T=3)
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > This is the ideal scenario for the data to be correctly reflected.
>>> > >
>>> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
>>> hence,
>>> > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
>>> > >
>>> > > But, we also know that KeyValues compare first by ROW, then by Column
>>> and
>>> > > then by timestamp in reverse order
>>> > >
>>> > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
>>> > >
>>> > > This seems to be contradicting and my main worry is that in a skip
>>> list,
>>> > it
>>> > > is quite possible for skipping to happen as you go through the high
>>> level
>>> > > express lanes and it could be possible for one of these entries to
>>> never
>>> > > actually even see the delete marker. For example consider the case
>>> above
>>> > > where entry #1 and entry #5 form the higher level of the skip list and
>>> > the
>>> > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
>>> and
>>> > it
>>> > > could end up in the wrong location.
>>> > >
>>> > > Obviously, if we cleanse all the row contents when a get a ROW level
>>> > delete
>>> > > marker, we are fine but I want to know if that is the case. If we are
>>> not
>>> > > really cleansing all the row contents when we get a ROW level delete
>>> > > marker, then I want to know why the above scenario can not lead to bugs
>>> > :)
>>> > >
>>> > > Varun
>>> > >
>>> > >
>>> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
>>> > > <vl...@gmail.com>wrote:
>>> > >
>>> > >> Varun,
>>> > >>
>>> > >> There is no need to open new JIRA - there are two already:
>>> > >> https://issues.apache.org/jira/browse/HBASE-9769
>>> > >> https://issues.apache.org/jira/browse/HBASE-9778
>>> > >>
>>> > >> Both with patches, you can grab and test them.
>>> > >>
>>> > >> -Vladimir
>>> > >>
>>> > >>
>>> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
>>> > wrote:
>>> > >>
>>> > >>> Hi lars,
>>> > >>>
>>> > >>> Thanks for the background. It seems that for our case, we will have
>>> to
>>> > >>> consider some solution like the Facebook one, since the next column
>>> is
>>> > >>> always the next one - this can be a simple flag. I am going to raise
>>> a
>>> > >> JIRA
>>> > >>> and we can discuss there.
>>> > >>>
>>> > >>> Thanks
>>> > >>> Varun
>>> > >>>
>>> > >>>
>>> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
>>> > wrote:
>>> > >>>
>>> > >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime
>>> in
>>> > >>>> soon. :)
>>> > >>>>
>>> > >>>> Reseek is expensive compared to next if next would get us the KV
>>> we're
>>> > >>>> looking for. However, HBase does not know that ahead of time. There
>>> > >> might
>>> > >>>> be a 1000 versions of the previous KV to be skipped first.
>>> > >>>> HBase seeks in three situation:
>>> > >>>> 1. Seek to the next column (there might be a lot of versions to
>>> skip)
>>> > >>>> 2. Seek to the next row (there might be a lot of versions and other
>>> > >>>> columns to skip)
>>> > >>>> 3. Seek to a row via a hint
>>> > >>>>
>>> > >>>> #3 is definitely useful, with that one can implement very efficient
>>> > >> "skip
>>> > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>>> > >>>> #2 is helpful if there are many columns and one only "selects" a few
>>> > >> (and
>>> > >>>> of course also if there are many versions of columns)
>>> > >>>> #1 is only helpful when we expect there to be many versions. Or of
>>> the
>>> > >>>> size of a typical KV aproaches the block size, since then we'd need
>>> to
>>> > >>> seek
>>> > >>>> to the find the next block anyway.
>>> > >>>>
>>> > >>>> You might well be a victim of #1. Are your rows 10-20 columns or is
>>> > >> that
>>> > >>>> just the number of column you return?
>>> > >>>>
>>> > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
>>> instruct
>>> > >>> the
>>> > >>>> scanner to not seek to the next column or the next row, but just
>>> issue
>>> > >>>> next()'s until the KV is found. Another suggested approach (I think
>>> by
>>> > >>> the
>>> > >>>> Facebook guys) was to issue next() opportunistically a few times,
>>> and
>>> > >>> only
>>> > >>>> when that did not get us to ther requested KV issue a reseek.
>>> > >>>> I have also thought of a near/far designation of seeks. For near
>>> seeks
>>> > >>>> we'd do a configurable number of next()'s first, then seek.
>>> > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
>>> > >>>>
>>> > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>>> > >>>>
>>> > >>>> I'll look at the trace a bit closers.
>>> > >>>> So far my scan profiling has been focused on data in the blockcache
>>> > >> since
>>> > >>>> in the normal case the vast majority of all data is found there and
>>> > >> only
>>> > >>>> recent changes are in the memstore.
>>> > >>>>
>>> > >>>> -- Lars
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> ________________________________
>>> > >>>> From: Varun Sharma <va...@pinterest.com>
>>> > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
>>> > >>> dev@hbase.apache.org"
>>> > >>>> <de...@hbase.apache.org>
>>> > >>>> Sent: Sunday, January 26, 2014 1:14 PM
>>> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>>> > >>>>
>>> > >>>>
>>> > >>>> Hi,
>>> > >>>>
>>> > >>>> We are seeing some unfortunately low performance in the memstore -
>>> we
>>> > >>> have
>>> > >>>> researched some of the previous JIRA(s) and seen some inefficiencies
>>> > in
>>> > >>> the
>>> > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
>>> cpu
>>> > >> at
>>> > >>>> weird points in time - the bug is hard to reproduce and there isn't
>>> > >> like
>>> > >>> a
>>> > >>>> huge # of extra reads going to that region server or any substantial
>>> > >>>> hotspot happening. The region server recovers the moment, we flush
>>> the
>>> > >>>> memstores or restart the region server. Our queries retrieve wide
>>> rows
>>> > >>>> which are upto 10-20 columns. A stack trace shows two things:
>>> > >>>>
>>> > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>>> > >>>> ConcurrentSkipListMap
>>> > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>>> > >>>> StoreScanner - this is understandable since the rows contain many
>>> > >> columns
>>> > >>>> and StoreScanner iterates one KeyValue at a time.
>>> > >>>>
>>> > >>>> So, I was looking at the code and it seems that every single time
>>> > there
>>> > >>> is
>>> > >>>> a reseek call on the same memstore scanner, we make a fresh call to
>>> > >> build
>>> > >>>> an iterator() on the skip list set - this means we an additional
>>> skip
>>> > >>> list
>>> > >>>> lookup for every column retrieved. SkipList lookups are O(n) and not
>>> > >>> O(1).
>>> > >>>>
>>> > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
>>> > >>> number
>>> > >>>> if exceeded, do a lookup. However, it seems this behaviour was
>>> > reverted
>>> > >>> by
>>> > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
>>> skip
>>> > >>> list
>>> > >>>> lookup rather than being an iterator.
>>> > >>>>
>>> > >>>> Are there any strong reasons against having the previous behaviour
>>> of
>>> > >>>> scanning a small # of keys before degenerating to a skip list
>>> lookup ?
>>> > >>>> Seems like it would really help for sequential memstore scans and
>>> for
>>> > >>>> memstore gets with wide tables (even 10-20 columns).
>>> > >>>>
>>> > >>>> Thanks
>>> > >>>> Varun
>>> > >>
>>> >
>>>
>>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by lars hofhansl <la...@apache.org>.
I see you figured it out. I should read all email before I sent my last reply.



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Cc: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Tuesday, January 28, 2014 9:43 AM
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
 


Ohk I think I understand this better now. So the order will actually be, something like this, at step #3

(ROW, <DELETE>, T=2)
(ROW, COL1, T=3)
(ROW, COL1, T=1)  - filtered

(ROW, COL2, T=3)
(ROW, COL2, T=1)  - filtered
(ROW, COL3, T=3)
(ROW, COL3, T=1)  - filtered

The ScanDeleteTracker class would simply filter out columns which have a timestamp < 2.

Varun



On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <va...@pinterest.com> wrote:

Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1) because COL2 > COL1 lexicographically. However in the above example, it comes before the delete marker and hence before (ROW, COL1, T=1) which is wrong, no ?
>
>
>
>On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:
>
>bq. Now, clearly there will be columns above the delete marker which are
>>
>>smaller than the ones below it.
>>
>>This is where closer look is needed. Part of the confusion arises from
>>usage of > and < in your example.
>>(ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>>
>>Here, in terms of sort order, 'above' means before. 'below it' would mean
>>after. So 'smaller' would mean before.
>>
>>Cheers
>>
>>
>>
>>On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:
>>
>>> Hi Ted,
>>>
>>> Not satisfied with your answer, the document you sent does not talk about
>>> Delete ColumnFamily marker sort order. For the delete family marker to
>>> work, it has to mask *all* columns of a family. Hence it has to be above
>>> all the older columns. All the new columns must come above this column
>>> family delete marker. Now, clearly there will be columns above the delete
>>> marker which are smaller than the ones below it.
>>>
>>> The document talks nothing about delete marker order, could you answer the
>>> question by looking through the example ?
>>>
>>> Varun
>>>
>>>
>>> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> > Varun:
>>> > Take a look at http://hbase.apache.org/book.html#dm.sort
>>> >
>>> > There's no contradiction.
>>> >
>>> > Cheers
>>> >
>>> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>>> >
>>> > > Actually, I now have another question because of the way our work load
>>> is
>>> > > structured. We use a wide schema and each time we write, we delete the
>>> > > entire row and write a fresh set of columns - we want to make sure no
>>> old
>>> > > columns survive. So, I just want to see if my picture of the memstore
>>> at
>>> > > this point is correct or not. My understanding is that Memstore is
>>> > > basically a skip list of keyvalues and compares the values using
>>> KeyValue
>>> > > comparator
>>> > >
>>> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
>>> > >
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
>>> > > memstore has - my understanding is that we do not delete in the
>>> memstore
>>> > > but only add markers:
>>> > >
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > 3) Now we write our new fresh row for *T=3* - it should get inserted
>>> > above
>>> > > the delete.
>>> > >
>>> > > (ROW, COL1, T=3)
>>> > > (ROW, COL2, T=3)
>>> > > (ROW, COL3, T=3)
>>> > > (ROW, <DELETE>, T=2)
>>> > > (ROW, COL1, T=1)
>>> > > (ROW, COL2, T=1)
>>> > > (ROW, COL3, T=1)
>>> > >
>>> > > This is the ideal scenario for the data to be correctly reflected.
>>> > >
>>> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
>>> hence,
>>> > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
>>> > >
>>> > > But, we also know that KeyValues compare first by ROW, then by Column
>>> and
>>> > > then by timestamp in reverse order
>>> > >
>>> > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
>>> > >
>>> > > This seems to be contradicting and my main worry is that in a skip
>>> list,
>>> > it
>>> > > is quite possible for skipping to happen as you go through the high
>>> level
>>> > > express lanes and it could be possible for one of these entries to
>>> never
>>> > > actually even see the delete marker. For example consider the case
>>> above
>>> > > where entry #1 and entry #5 form the higher level of the skip list and
>>> > the
>>> > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
>>> and
>>> > it
>>> > > could end up in the wrong location.
>>> > >
>>> > > Obviously, if we cleanse all the row contents when a get a ROW level
>>> > delete
>>> > > marker, we are fine but I want to know if that is the case. If we are
>>> not
>>> > > really cleansing all the row contents when we get a ROW level delete
>>> > > marker, then I want to know why the above scenario can not lead to bugs
>>> > :)
>>> > >
>>> > > Varun
>>> > >
>>> > >
>>> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
>>> > > <vl...@gmail.com>wrote:
>>> > >
>>> > >> Varun,
>>> > >>
>>> > >> There is no need to open new JIRA - there are two already:
>>> > >> https://issues.apache.org/jira/browse/HBASE-9769
>>> > >> https://issues.apache.org/jira/browse/HBASE-9778
>>> > >>
>>> > >> Both with patches, you can grab and test them.
>>> > >>
>>> > >> -Vladimir
>>> > >>
>>> > >>
>>> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
>>> > wrote:
>>> > >>
>>> > >>> Hi lars,
>>> > >>>
>>> > >>> Thanks for the background. It seems that for our case, we will have
>>> to
>>> > >>> consider some solution like the Facebook one, since the next column
>>> is
>>> > >>> always the next one - this can be a simple flag. I am going to raise
>>> a
>>> > >> JIRA
>>> > >>> and we can discuss there.
>>> > >>>
>>> > >>> Thanks
>>> > >>> Varun
>>> > >>>
>>> > >>>
>>> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
>>> > wrote:
>>> > >>>
>>> > >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime
>>> in
>>> > >>>> soon. :)
>>> > >>>>
>>> > >>>> Reseek is expensive compared to next if next would get us the KV
>>> we're
>>> > >>>> looking for. However, HBase does not know that ahead of time. There
>>> > >> might
>>> > >>>> be a 1000 versions of the previous KV to be skipped first.
>>> > >>>> HBase seeks in three situation:
>>> > >>>> 1. Seek to the next column (there might be a lot of versions to
>>> skip)
>>> > >>>> 2. Seek to the next row (there might be a lot of versions and other
>>> > >>>> columns to skip)
>>> > >>>> 3. Seek to a row via a hint
>>> > >>>>
>>> > >>>> #3 is definitely useful, with that one can implement very efficient
>>> > >> "skip
>>> > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>>> > >>>> #2 is helpful if there are many columns and one only "selects" a few
>>> > >> (and
>>> > >>>> of course also if there are many versions of columns)
>>> > >>>> #1 is only helpful when we expect there to be many versions. Or of
>>> the
>>> > >>>> size of a typical KV aproaches the block size, since then we'd need
>>> to
>>> > >>> seek
>>> > >>>> to the find the next block anyway.
>>> > >>>>
>>> > >>>> You might well be a victim of #1. Are your rows 10-20 columns or is
>>> > >> that
>>> > >>>> just the number of column you return?
>>> > >>>>
>>> > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
>>> instruct
>>> > >>> the
>>> > >>>> scanner to not seek to the next column or the next row, but just
>>> issue
>>> > >>>> next()'s until the KV is found. Another suggested approach (I think
>>> by
>>> > >>> the
>>> > >>>> Facebook guys) was to issue next() opportunistically a few times,
>>> and
>>> > >>> only
>>> > >>>> when that did not get us to ther requested KV issue a reseek.
>>> > >>>> I have also thought of a near/far designation of seeks. For near
>>> seeks
>>> > >>>> we'd do a configurable number of next()'s first, then seek.
>>> > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
>>> > >>>>
>>> > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>>> > >>>>
>>> > >>>> I'll look at the trace a bit closers.
>>> > >>>> So far my scan profiling has been focused on data in the blockcache
>>> > >> since
>>> > >>>> in the normal case the vast majority of all data is found there and
>>> > >> only
>>> > >>>> recent changes are in the memstore.
>>> > >>>>
>>> > >>>> -- Lars
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> ________________________________
>>> > >>>> From: Varun Sharma <va...@pinterest.com>
>>> > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
>>> > >>> dev@hbase.apache.org"
>>> > >>>> <de...@hbase.apache.org>
>>> > >>>> Sent: Sunday, January 26, 2014 1:14 PM
>>> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>>> > >>>>
>>> > >>>>
>>> > >>>> Hi,
>>> > >>>>
>>> > >>>> We are seeing some unfortunately low performance in the memstore -
>>> we
>>> > >>> have
>>> > >>>> researched some of the previous JIRA(s) and seen some inefficiencies
>>> > in
>>> > >>> the
>>> > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
>>> cpu
>>> > >> at
>>> > >>>> weird points in time - the bug is hard to reproduce and there isn't
>>> > >> like
>>> > >>> a
>>> > >>>> huge # of extra reads going to that region server or any substantial
>>> > >>>> hotspot happening. The region server recovers the moment, we flush
>>> the
>>> > >>>> memstores or restart the region server. Our queries retrieve wide
>>> rows
>>> > >>>> which are upto 10-20 columns. A stack trace shows two things:
>>> > >>>>
>>> > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>>> > >>>> ConcurrentSkipListMap
>>> > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>>> > >>>> StoreScanner - this is understandable since the rows contain many
>>> > >> columns
>>> > >>>> and StoreScanner iterates one KeyValue at a time.
>>> > >>>>
>>> > >>>> So, I was looking at the code and it seems that every single time
>>> > there
>>> > >>> is
>>> > >>>> a reseek call on the same memstore scanner, we make a fresh call to
>>> > >> build
>>> > >>>> an iterator() on the skip list set - this means we an additional
>>> skip
>>> > >>> list
>>> > >>>> lookup for every column retrieved. SkipList lookups are O(n) and not
>>> > >>> O(1).
>>> > >>>>
>>> > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
>>> > >>> number
>>> > >>>> if exceeded, do a lookup. However, it seems this behaviour was
>>> > reverted
>>> > >>> by
>>> > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
>>> skip
>>> > >>> list
>>> > >>>> lookup rather than being an iterator.
>>> > >>>>
>>> > >>>> Are there any strong reasons against having the previous behaviour
>>> of
>>> > >>>> scanning a small # of keys before degenerating to a skip list
>>> lookup ?
>>> > >>>> Seems like it would really help for sequential memstore scans and
>>> for
>>> > >>>> memstore gets with wide tables (even 10-20 columns).
>>> > >>>>
>>> > >>>> Thanks
>>> > >>>> Varun
>>> > >>
>>> >
>>>
>>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Vladimir Rodionov <vl...@gmail.com>.
Its ScanQueryMatcher-ScanDeleteTracker responsibility to process  deletes
during scanning.


On Tue, Jan 28, 2014 at 9:43 AM, Varun Sharma <va...@pinterest.com> wrote:

> Ohk I think I understand this better now. So the order will actually be,
> something like this, at step #3
>
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=3)
> (ROW, COL1, T=1)  - filtered
> (ROW, COL2, T=3)
> (ROW, COL2, T=1)  - filtered
> (ROW, COL3, T=3)
> (ROW, COL3, T=1)  - filtered
>
> The ScanDeleteTracker class would simply filter out columns which have a
> timestamp < 2.
>
> Varun
>
>
> On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
> > because COL2 > COL1 lexicographically. However in the above example, it
> > comes before the delete marker and hence before (ROW, COL1, T=1) which is
> > wrong, no ?
> >
> >
> > On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> bq. Now, clearly there will be columns above the delete marker which are
> >> smaller than the ones below it.
> >>
> >> This is where closer look is needed. Part of the confusion arises from
> >> usage of > and < in your example.
> >> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
> >>
> >> Here, in terms of sort order, 'above' means before. 'below it' would
> mean
> >> after. So 'smaller' would mean before.
> >>
> >> Cheers
> >>
> >>
> >> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > Not satisfied with your answer, the document you sent does not talk
> >> about
> >> > Delete ColumnFamily marker sort order. For the delete family marker to
> >> > work, it has to mask *all* columns of a family. Hence it has to be
> above
> >> > all the older columns. All the new columns must come above this column
> >> > family delete marker. Now, clearly there will be columns above the
> >> delete
> >> > marker which are smaller than the ones below it.
> >> >
> >> > The document talks nothing about delete marker order, could you answer
> >> the
> >> > question by looking through the example ?
> >> >
> >> > Varun
> >> >
> >> >
> >> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
> >> >
> >> > > Varun:
> >> > > Take a look at http://hbase.apache.org/book.html#dm.sort
> >> > >
> >> > > There's no contradiction.
> >> > >
> >> > > Cheers
> >> > >
> >> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >> > >
> >> > > > Actually, I now have another question because of the way our work
> >> load
> >> > is
> >> > > > structured. We use a wide schema and each time we write, we delete
> >> the
> >> > > > entire row and write a fresh set of columns - we want to make sure
> >> no
> >> > old
> >> > > > columns survive. So, I just want to see if my picture of the
> >> memstore
> >> > at
> >> > > > this point is correct or not. My understanding is that Memstore is
> >> > > > basically a skip list of keyvalues and compares the values using
> >> > KeyValue
> >> > > > comparator
> >> > > >
> >> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> >> > > >
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2.
> So
> >> > > > memstore has - my understanding is that we do not delete in the
> >> > memstore
> >> > > > but only add markers:
> >> > > >
> >> > > > (ROW, <DELETE>, T=2)
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > 3) Now we write our new fresh row for *T=3* - it should get
> inserted
> >> > > above
> >> > > > the delete.
> >> > > >
> >> > > > (ROW, COL1, T=3)
> >> > > > (ROW, COL2, T=3)
> >> > > > (ROW, COL3, T=3)
> >> > > > (ROW, <DELETE>, T=2)
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > This is the ideal scenario for the data to be correctly reflected.
> >> > > >
> >> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> >> > hence,
> >> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> >> > > >
> >> > > > But, we also know that KeyValues compare first by ROW, then by
> >> Column
> >> > and
> >> > > > then by timestamp in reverse order
> >> > > >
> >> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> >> > > >
> >> > > > This seems to be contradicting and my main worry is that in a skip
> >> > list,
> >> > > it
> >> > > > is quite possible for skipping to happen as you go through the
> high
> >> > level
> >> > > > express lanes and it could be possible for one of these entries to
> >> > never
> >> > > > actually even see the delete marker. For example consider the case
> >> > above
> >> > > > where entry #1 and entry #5 form the higher level of the skip list
> >> and
> >> > > the
> >> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4,
> T=3)
> >> > and
> >> > > it
> >> > > > could end up in the wrong location.
> >> > > >
> >> > > > Obviously, if we cleanse all the row contents when a get a ROW
> level
> >> > > delete
> >> > > > marker, we are fine but I want to know if that is the case. If we
> >> are
> >> > not
> >> > > > really cleansing all the row contents when we get a ROW level
> delete
> >> > > > marker, then I want to know why the above scenario can not lead to
> >> bugs
> >> > > :)
> >> > > >
> >> > > > Varun
> >> > > >
> >> > > >
> >> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> >> > > > <vl...@gmail.com>wrote:
> >> > > >
> >> > > >> Varun,
> >> > > >>
> >> > > >> There is no need to open new JIRA - there are two already:
> >> > > >> https://issues.apache.org/jira/browse/HBASE-9769
> >> > > >> https://issues.apache.org/jira/browse/HBASE-9778
> >> > > >>
> >> > > >> Both with patches, you can grab and test them.
> >> > > >>
> >> > > >> -Vladimir
> >> > > >>
> >> > > >>
> >> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <
> varun@pinterest.com
> >> >
> >> > > wrote:
> >> > > >>
> >> > > >>> Hi lars,
> >> > > >>>
> >> > > >>> Thanks for the background. It seems that for our case, we will
> >> have
> >> > to
> >> > > >>> consider some solution like the Facebook one, since the next
> >> column
> >> > is
> >> > > >>> always the next one - this can be a simple flag. I am going to
> >> raise
> >> > a
> >> > > >> JIRA
> >> > > >>> and we can discuss there.
> >> > > >>>
> >> > > >>> Thanks
> >> > > >>> Varun
> >> > > >>>
> >> > > >>>
> >> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <
> larsh@apache.org>
> >> > > wrote:
> >> > > >>>
> >> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
> >> chime
> >> > in
> >> > > >>>> soon. :)
> >> > > >>>>
> >> > > >>>> Reseek is expensive compared to next if next would get us the
> KV
> >> > we're
> >> > > >>>> looking for. However, HBase does not know that ahead of time.
> >> There
> >> > > >> might
> >> > > >>>> be a 1000 versions of the previous KV to be skipped first.
> >> > > >>>> HBase seeks in three situation:
> >> > > >>>> 1. Seek to the next column (there might be a lot of versions to
> >> > skip)
> >> > > >>>> 2. Seek to the next row (there might be a lot of versions and
> >> other
> >> > > >>>> columns to skip)
> >> > > >>>> 3. Seek to a row via a hint
> >> > > >>>>
> >> > > >>>> #3 is definitely useful, with that one can implement very
> >> efficient
> >> > > >> "skip
> >> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> >> > > >>>> #2 is helpful if there are many columns and one only "selects"
> a
> >> few
> >> > > >> (and
> >> > > >>>> of course also if there are many versions of columns)
> >> > > >>>> #1 is only helpful when we expect there to be many versions. Or
> >> of
> >> > the
> >> > > >>>> size of a typical KV aproaches the block size, since then we'd
> >> need
> >> > to
> >> > > >>> seek
> >> > > >>>> to the find the next block anyway.
> >> > > >>>>
> >> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns
> or
> >> is
> >> > > >> that
> >> > > >>>> just the number of column you return?
> >> > > >>>>
> >> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> >> > instruct
> >> > > >>> the
> >> > > >>>> scanner to not seek to the next column or the next row, but
> just
> >> > issue
> >> > > >>>> next()'s until the KV is found. Another suggested approach (I
> >> think
> >> > by
> >> > > >>> the
> >> > > >>>> Facebook guys) was to issue next() opportunistically a few
> times,
> >> > and
> >> > > >>> only
> >> > > >>>> when that did not get us to ther requested KV issue a reseek.
> >> > > >>>> I have also thought of a near/far designation of seeks. For
> near
> >> > seeks
> >> > > >>>> we'd do a configurable number of next()'s first, then seek.
> >> > > >>>> "near" seeks would be those of category #1 (and maybe #2)
> above.
> >> > > >>>>
> >> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915,
> maybe)
> >> > > >>>>
> >> > > >>>> I'll look at the trace a bit closers.
> >> > > >>>> So far my scan profiling has been focused on data in the
> >> blockcache
> >> > > >> since
> >> > > >>>> in the normal case the vast majority of all data is found there
> >> and
> >> > > >> only
> >> > > >>>> recent changes are in the memstore.
> >> > > >>>>
> >> > > >>>> -- Lars
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> ________________________________
> >> > > >>>> From: Varun Sharma <va...@pinterest.com>
> >> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> >> > > >>> dev@hbase.apache.org"
> >> > > >>>> <de...@hbase.apache.org>
> >> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> >> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> Hi,
> >> > > >>>>
> >> > > >>>> We are seeing some unfortunately low performance in the
> memstore
> >> -
> >> > we
> >> > > >>> have
> >> > > >>>> researched some of the previous JIRA(s) and seen some
> >> inefficiencies
> >> > > in
> >> > > >>> the
> >> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting
> 100
> >> %
> >> > cpu
> >> > > >> at
> >> > > >>>> weird points in time - the bug is hard to reproduce and there
> >> isn't
> >> > > >> like
> >> > > >>> a
> >> > > >>>> huge # of extra reads going to that region server or any
> >> substantial
> >> > > >>>> hotspot happening. The region server recovers the moment, we
> >> flush
> >> > the
> >> > > >>>> memstores or restart the region server. Our queries retrieve
> wide
> >> > rows
> >> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
> >> > > >>>>
> >> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> >> > > >>>> ConcurrentSkipListMap
> >> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column
> inside
> >> > > >>>> StoreScanner - this is understandable since the rows contain
> many
> >> > > >> columns
> >> > > >>>> and StoreScanner iterates one KeyValue at a time.
> >> > > >>>>
> >> > > >>>> So, I was looking at the code and it seems that every single
> time
> >> > > there
> >> > > >>> is
> >> > > >>>> a reseek call on the same memstore scanner, we make a fresh
> call
> >> to
> >> > > >> build
> >> > > >>>> an iterator() on the skip list set - this means we an
> additional
> >> > skip
> >> > > >>> list
> >> > > >>>> lookup for every column retrieved. SkipList lookups are O(n)
> and
> >> not
> >> > > >>> O(1).
> >> > > >>>>
> >> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
> >> that
> >> > > >>> number
> >> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> >> > > reverted
> >> > > >>> by
> >> > > >>>> HBASE 4195 and every next row/next column is now a reseek()
> and a
> >> > skip
> >> > > >>> list
> >> > > >>>> lookup rather than being an iterator.
> >> > > >>>>
> >> > > >>>> Are there any strong reasons against having the previous
> >> behaviour
> >> > of
> >> > > >>>> scanning a small # of keys before degenerating to a skip list
> >> > lookup ?
> >> > > >>>> Seems like it would really help for sequential memstore scans
> and
> >> > for
> >> > > >>>> memstore gets with wide tables (even 10-20 columns).
> >> > > >>>>
> >> > > >>>> Thanks
> >> > > >>>> Varun
> >> > > >>
> >> > >
> >> >
> >>
> >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Vladimir Rodionov <vl...@gmail.com>.
Its ScanQueryMatcher-ScanDeleteTracker responsibility to process  deletes
during scanning.


On Tue, Jan 28, 2014 at 9:43 AM, Varun Sharma <va...@pinterest.com> wrote:

> Ohk I think I understand this better now. So the order will actually be,
> something like this, at step #3
>
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=3)
> (ROW, COL1, T=1)  - filtered
> (ROW, COL2, T=3)
> (ROW, COL2, T=1)  - filtered
> (ROW, COL3, T=3)
> (ROW, COL3, T=1)  - filtered
>
> The ScanDeleteTracker class would simply filter out columns which have a
> timestamp < 2.
>
> Varun
>
>
> On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
> > because COL2 > COL1 lexicographically. However in the above example, it
> > comes before the delete marker and hence before (ROW, COL1, T=1) which is
> > wrong, no ?
> >
> >
> > On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> bq. Now, clearly there will be columns above the delete marker which are
> >> smaller than the ones below it.
> >>
> >> This is where closer look is needed. Part of the confusion arises from
> >> usage of > and < in your example.
> >> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
> >>
> >> Here, in terms of sort order, 'above' means before. 'below it' would
> mean
> >> after. So 'smaller' would mean before.
> >>
> >> Cheers
> >>
> >>
> >> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > Not satisfied with your answer, the document you sent does not talk
> >> about
> >> > Delete ColumnFamily marker sort order. For the delete family marker to
> >> > work, it has to mask *all* columns of a family. Hence it has to be
> above
> >> > all the older columns. All the new columns must come above this column
> >> > family delete marker. Now, clearly there will be columns above the
> >> delete
> >> > marker which are smaller than the ones below it.
> >> >
> >> > The document talks nothing about delete marker order, could you answer
> >> the
> >> > question by looking through the example ?
> >> >
> >> > Varun
> >> >
> >> >
> >> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
> >> >
> >> > > Varun:
> >> > > Take a look at http://hbase.apache.org/book.html#dm.sort
> >> > >
> >> > > There's no contradiction.
> >> > >
> >> > > Cheers
> >> > >
> >> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >> > >
> >> > > > Actually, I now have another question because of the way our work
> >> load
> >> > is
> >> > > > structured. We use a wide schema and each time we write, we delete
> >> the
> >> > > > entire row and write a fresh set of columns - we want to make sure
> >> no
> >> > old
> >> > > > columns survive. So, I just want to see if my picture of the
> >> memstore
> >> > at
> >> > > > this point is correct or not. My understanding is that Memstore is
> >> > > > basically a skip list of keyvalues and compares the values using
> >> > KeyValue
> >> > > > comparator
> >> > > >
> >> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> >> > > >
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2.
> So
> >> > > > memstore has - my understanding is that we do not delete in the
> >> > memstore
> >> > > > but only add markers:
> >> > > >
> >> > > > (ROW, <DELETE>, T=2)
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > 3) Now we write our new fresh row for *T=3* - it should get
> inserted
> >> > > above
> >> > > > the delete.
> >> > > >
> >> > > > (ROW, COL1, T=3)
> >> > > > (ROW, COL2, T=3)
> >> > > > (ROW, COL3, T=3)
> >> > > > (ROW, <DELETE>, T=2)
> >> > > > (ROW, COL1, T=1)
> >> > > > (ROW, COL2, T=1)
> >> > > > (ROW, COL3, T=1)
> >> > > >
> >> > > > This is the ideal scenario for the data to be correctly reflected.
> >> > > >
> >> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> >> > hence,
> >> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> >> > > >
> >> > > > But, we also know that KeyValues compare first by ROW, then by
> >> Column
> >> > and
> >> > > > then by timestamp in reverse order
> >> > > >
> >> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> >> > > >
> >> > > > This seems to be contradicting and my main worry is that in a skip
> >> > list,
> >> > > it
> >> > > > is quite possible for skipping to happen as you go through the
> high
> >> > level
> >> > > > express lanes and it could be possible for one of these entries to
> >> > never
> >> > > > actually even see the delete marker. For example consider the case
> >> > above
> >> > > > where entry #1 and entry #5 form the higher level of the skip list
> >> and
> >> > > the
> >> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4,
> T=3)
> >> > and
> >> > > it
> >> > > > could end up in the wrong location.
> >> > > >
> >> > > > Obviously, if we cleanse all the row contents when a get a ROW
> level
> >> > > delete
> >> > > > marker, we are fine but I want to know if that is the case. If we
> >> are
> >> > not
> >> > > > really cleansing all the row contents when we get a ROW level
> delete
> >> > > > marker, then I want to know why the above scenario can not lead to
> >> bugs
> >> > > :)
> >> > > >
> >> > > > Varun
> >> > > >
> >> > > >
> >> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> >> > > > <vl...@gmail.com>wrote:
> >> > > >
> >> > > >> Varun,
> >> > > >>
> >> > > >> There is no need to open new JIRA - there are two already:
> >> > > >> https://issues.apache.org/jira/browse/HBASE-9769
> >> > > >> https://issues.apache.org/jira/browse/HBASE-9778
> >> > > >>
> >> > > >> Both with patches, you can grab and test them.
> >> > > >>
> >> > > >> -Vladimir
> >> > > >>
> >> > > >>
> >> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <
> varun@pinterest.com
> >> >
> >> > > wrote:
> >> > > >>
> >> > > >>> Hi lars,
> >> > > >>>
> >> > > >>> Thanks for the background. It seems that for our case, we will
> >> have
> >> > to
> >> > > >>> consider some solution like the Facebook one, since the next
> >> column
> >> > is
> >> > > >>> always the next one - this can be a simple flag. I am going to
> >> raise
> >> > a
> >> > > >> JIRA
> >> > > >>> and we can discuss there.
> >> > > >>>
> >> > > >>> Thanks
> >> > > >>> Varun
> >> > > >>>
> >> > > >>>
> >> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <
> larsh@apache.org>
> >> > > wrote:
> >> > > >>>
> >> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
> >> chime
> >> > in
> >> > > >>>> soon. :)
> >> > > >>>>
> >> > > >>>> Reseek is expensive compared to next if next would get us the
> KV
> >> > we're
> >> > > >>>> looking for. However, HBase does not know that ahead of time.
> >> There
> >> > > >> might
> >> > > >>>> be a 1000 versions of the previous KV to be skipped first.
> >> > > >>>> HBase seeks in three situation:
> >> > > >>>> 1. Seek to the next column (there might be a lot of versions to
> >> > skip)
> >> > > >>>> 2. Seek to the next row (there might be a lot of versions and
> >> other
> >> > > >>>> columns to skip)
> >> > > >>>> 3. Seek to a row via a hint
> >> > > >>>>
> >> > > >>>> #3 is definitely useful, with that one can implement very
> >> efficient
> >> > > >> "skip
> >> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> >> > > >>>> #2 is helpful if there are many columns and one only "selects"
> a
> >> few
> >> > > >> (and
> >> > > >>>> of course also if there are many versions of columns)
> >> > > >>>> #1 is only helpful when we expect there to be many versions. Or
> >> of
> >> > the
> >> > > >>>> size of a typical KV aproaches the block size, since then we'd
> >> need
> >> > to
> >> > > >>> seek
> >> > > >>>> to the find the next block anyway.
> >> > > >>>>
> >> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns
> or
> >> is
> >> > > >> that
> >> > > >>>> just the number of column you return?
> >> > > >>>>
> >> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> >> > instruct
> >> > > >>> the
> >> > > >>>> scanner to not seek to the next column or the next row, but
> just
> >> > issue
> >> > > >>>> next()'s until the KV is found. Another suggested approach (I
> >> think
> >> > by
> >> > > >>> the
> >> > > >>>> Facebook guys) was to issue next() opportunistically a few
> times,
> >> > and
> >> > > >>> only
> >> > > >>>> when that did not get us to ther requested KV issue a reseek.
> >> > > >>>> I have also thought of a near/far designation of seeks. For
> near
> >> > seeks
> >> > > >>>> we'd do a configurable number of next()'s first, then seek.
> >> > > >>>> "near" seeks would be those of category #1 (and maybe #2)
> above.
> >> > > >>>>
> >> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915,
> maybe)
> >> > > >>>>
> >> > > >>>> I'll look at the trace a bit closers.
> >> > > >>>> So far my scan profiling has been focused on data in the
> >> blockcache
> >> > > >> since
> >> > > >>>> in the normal case the vast majority of all data is found there
> >> and
> >> > > >> only
> >> > > >>>> recent changes are in the memstore.
> >> > > >>>>
> >> > > >>>> -- Lars
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> ________________________________
> >> > > >>>> From: Varun Sharma <va...@pinterest.com>
> >> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> >> > > >>> dev@hbase.apache.org"
> >> > > >>>> <de...@hbase.apache.org>
> >> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> >> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> Hi,
> >> > > >>>>
> >> > > >>>> We are seeing some unfortunately low performance in the
> memstore
> >> -
> >> > we
> >> > > >>> have
> >> > > >>>> researched some of the previous JIRA(s) and seen some
> >> inefficiencies
> >> > > in
> >> > > >>> the
> >> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting
> 100
> >> %
> >> > cpu
> >> > > >> at
> >> > > >>>> weird points in time - the bug is hard to reproduce and there
> >> isn't
> >> > > >> like
> >> > > >>> a
> >> > > >>>> huge # of extra reads going to that region server or any
> >> substantial
> >> > > >>>> hotspot happening. The region server recovers the moment, we
> >> flush
> >> > the
> >> > > >>>> memstores or restart the region server. Our queries retrieve
> wide
> >> > rows
> >> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
> >> > > >>>>
> >> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> >> > > >>>> ConcurrentSkipListMap
> >> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column
> inside
> >> > > >>>> StoreScanner - this is understandable since the rows contain
> many
> >> > > >> columns
> >> > > >>>> and StoreScanner iterates one KeyValue at a time.
> >> > > >>>>
> >> > > >>>> So, I was looking at the code and it seems that every single
> time
> >> > > there
> >> > > >>> is
> >> > > >>>> a reseek call on the same memstore scanner, we make a fresh
> call
> >> to
> >> > > >> build
> >> > > >>>> an iterator() on the skip list set - this means we an
> additional
> >> > skip
> >> > > >>> list
> >> > > >>>> lookup for every column retrieved. SkipList lookups are O(n)
> and
> >> not
> >> > > >>> O(1).
> >> > > >>>>
> >> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
> >> that
> >> > > >>> number
> >> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> >> > > reverted
> >> > > >>> by
> >> > > >>>> HBASE 4195 and every next row/next column is now a reseek()
> and a
> >> > skip
> >> > > >>> list
> >> > > >>>> lookup rather than being an iterator.
> >> > > >>>>
> >> > > >>>> Are there any strong reasons against having the previous
> >> behaviour
> >> > of
> >> > > >>>> scanning a small # of keys before degenerating to a skip list
> >> > lookup ?
> >> > > >>>> Seems like it would really help for sequential memstore scans
> and
> >> > for
> >> > > >>>> memstore gets with wide tables (even 10-20 columns).
> >> > > >>>>
> >> > > >>>> Thanks
> >> > > >>>> Varun
> >> > > >>
> >> > >
> >> >
> >>
> >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Ohk I think I understand this better now. So the order will actually be,
something like this, at step #3

(ROW, <DELETE>, T=2)
(ROW, COL1, T=3)
(ROW, COL1, T=1)  - filtered
(ROW, COL2, T=3)
(ROW, COL2, T=1)  - filtered
(ROW, COL3, T=3)
(ROW, COL3, T=1)  - filtered

The ScanDeleteTracker class would simply filter out columns which have a
timestamp < 2.

Varun


On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <va...@pinterest.com> wrote:

> Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
> because COL2 > COL1 lexicographically. However in the above example, it
> comes before the delete marker and hence before (ROW, COL1, T=1) which is
> wrong, no ?
>
>
> On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> bq. Now, clearly there will be columns above the delete marker which are
>> smaller than the ones below it.
>>
>> This is where closer look is needed. Part of the confusion arises from
>> usage of > and < in your example.
>> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>>
>> Here, in terms of sort order, 'above' means before. 'below it' would mean
>> after. So 'smaller' would mean before.
>>
>> Cheers
>>
>>
>> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi Ted,
>> >
>> > Not satisfied with your answer, the document you sent does not talk
>> about
>> > Delete ColumnFamily marker sort order. For the delete family marker to
>> > work, it has to mask *all* columns of a family. Hence it has to be above
>> > all the older columns. All the new columns must come above this column
>> > family delete marker. Now, clearly there will be columns above the
>> delete
>> > marker which are smaller than the ones below it.
>> >
>> > The document talks nothing about delete marker order, could you answer
>> the
>> > question by looking through the example ?
>> >
>> > Varun
>> >
>> >
>> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > Varun:
>> > > Take a look at http://hbase.apache.org/book.html#dm.sort
>> > >
>> > > There's no contradiction.
>> > >
>> > > Cheers
>> > >
>> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>> > >
>> > > > Actually, I now have another question because of the way our work
>> load
>> > is
>> > > > structured. We use a wide schema and each time we write, we delete
>> the
>> > > > entire row and write a fresh set of columns - we want to make sure
>> no
>> > old
>> > > > columns survive. So, I just want to see if my picture of the
>> memstore
>> > at
>> > > > this point is correct or not. My understanding is that Memstore is
>> > > > basically a skip list of keyvalues and compares the values using
>> > KeyValue
>> > > > comparator
>> > > >
>> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
>> > > >
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
>> > > > memstore has - my understanding is that we do not delete in the
>> > memstore
>> > > > but only add markers:
>> > > >
>> > > > (ROW, <DELETE>, T=2)
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted
>> > > above
>> > > > the delete.
>> > > >
>> > > > (ROW, COL1, T=3)
>> > > > (ROW, COL2, T=3)
>> > > > (ROW, COL3, T=3)
>> > > > (ROW, <DELETE>, T=2)
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > This is the ideal scenario for the data to be correctly reflected.
>> > > >
>> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
>> > hence,
>> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
>> > > >
>> > > > But, we also know that KeyValues compare first by ROW, then by
>> Column
>> > and
>> > > > then by timestamp in reverse order
>> > > >
>> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
>> > > >
>> > > > This seems to be contradicting and my main worry is that in a skip
>> > list,
>> > > it
>> > > > is quite possible for skipping to happen as you go through the high
>> > level
>> > > > express lanes and it could be possible for one of these entries to
>> > never
>> > > > actually even see the delete marker. For example consider the case
>> > above
>> > > > where entry #1 and entry #5 form the higher level of the skip list
>> and
>> > > the
>> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
>> > and
>> > > it
>> > > > could end up in the wrong location.
>> > > >
>> > > > Obviously, if we cleanse all the row contents when a get a ROW level
>> > > delete
>> > > > marker, we are fine but I want to know if that is the case. If we
>> are
>> > not
>> > > > really cleansing all the row contents when we get a ROW level delete
>> > > > marker, then I want to know why the above scenario can not lead to
>> bugs
>> > > :)
>> > > >
>> > > > Varun
>> > > >
>> > > >
>> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
>> > > > <vl...@gmail.com>wrote:
>> > > >
>> > > >> Varun,
>> > > >>
>> > > >> There is no need to open new JIRA - there are two already:
>> > > >> https://issues.apache.org/jira/browse/HBASE-9769
>> > > >> https://issues.apache.org/jira/browse/HBASE-9778
>> > > >>
>> > > >> Both with patches, you can grab and test them.
>> > > >>
>> > > >> -Vladimir
>> > > >>
>> > > >>
>> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <varun@pinterest.com
>> >
>> > > wrote:
>> > > >>
>> > > >>> Hi lars,
>> > > >>>
>> > > >>> Thanks for the background. It seems that for our case, we will
>> have
>> > to
>> > > >>> consider some solution like the Facebook one, since the next
>> column
>> > is
>> > > >>> always the next one - this can be a simple flag. I am going to
>> raise
>> > a
>> > > >> JIRA
>> > > >>> and we can discuss there.
>> > > >>>
>> > > >>> Thanks
>> > > >>> Varun
>> > > >>>
>> > > >>>
>> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
>> > > wrote:
>> > > >>>
>> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
>> chime
>> > in
>> > > >>>> soon. :)
>> > > >>>>
>> > > >>>> Reseek is expensive compared to next if next would get us the KV
>> > we're
>> > > >>>> looking for. However, HBase does not know that ahead of time.
>> There
>> > > >> might
>> > > >>>> be a 1000 versions of the previous KV to be skipped first.
>> > > >>>> HBase seeks in three situation:
>> > > >>>> 1. Seek to the next column (there might be a lot of versions to
>> > skip)
>> > > >>>> 2. Seek to the next row (there might be a lot of versions and
>> other
>> > > >>>> columns to skip)
>> > > >>>> 3. Seek to a row via a hint
>> > > >>>>
>> > > >>>> #3 is definitely useful, with that one can implement very
>> efficient
>> > > >> "skip
>> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>> > > >>>> #2 is helpful if there are many columns and one only "selects" a
>> few
>> > > >> (and
>> > > >>>> of course also if there are many versions of columns)
>> > > >>>> #1 is only helpful when we expect there to be many versions. Or
>> of
>> > the
>> > > >>>> size of a typical KV aproaches the block size, since then we'd
>> need
>> > to
>> > > >>> seek
>> > > >>>> to the find the next block anyway.
>> > > >>>>
>> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or
>> is
>> > > >> that
>> > > >>>> just the number of column you return?
>> > > >>>>
>> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
>> > instruct
>> > > >>> the
>> > > >>>> scanner to not seek to the next column or the next row, but just
>> > issue
>> > > >>>> next()'s until the KV is found. Another suggested approach (I
>> think
>> > by
>> > > >>> the
>> > > >>>> Facebook guys) was to issue next() opportunistically a few times,
>> > and
>> > > >>> only
>> > > >>>> when that did not get us to ther requested KV issue a reseek.
>> > > >>>> I have also thought of a near/far designation of seeks. For near
>> > seeks
>> > > >>>> we'd do a configurable number of next()'s first, then seek.
>> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
>> > > >>>>
>> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>> > > >>>>
>> > > >>>> I'll look at the trace a bit closers.
>> > > >>>> So far my scan profiling has been focused on data in the
>> blockcache
>> > > >> since
>> > > >>>> in the normal case the vast majority of all data is found there
>> and
>> > > >> only
>> > > >>>> recent changes are in the memstore.
>> > > >>>>
>> > > >>>> -- Lars
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> ________________________________
>> > > >>>> From: Varun Sharma <va...@pinterest.com>
>> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
>> > > >>> dev@hbase.apache.org"
>> > > >>>> <de...@hbase.apache.org>
>> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
>> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>> > > >>>>
>> > > >>>>
>> > > >>>> Hi,
>> > > >>>>
>> > > >>>> We are seeing some unfortunately low performance in the memstore
>> -
>> > we
>> > > >>> have
>> > > >>>> researched some of the previous JIRA(s) and seen some
>> inefficiencies
>> > > in
>> > > >>> the
>> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100
>> %
>> > cpu
>> > > >> at
>> > > >>>> weird points in time - the bug is hard to reproduce and there
>> isn't
>> > > >> like
>> > > >>> a
>> > > >>>> huge # of extra reads going to that region server or any
>> substantial
>> > > >>>> hotspot happening. The region server recovers the moment, we
>> flush
>> > the
>> > > >>>> memstores or restart the region server. Our queries retrieve wide
>> > rows
>> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
>> > > >>>>
>> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>> > > >>>> ConcurrentSkipListMap
>> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>> > > >>>> StoreScanner - this is understandable since the rows contain many
>> > > >> columns
>> > > >>>> and StoreScanner iterates one KeyValue at a time.
>> > > >>>>
>> > > >>>> So, I was looking at the code and it seems that every single time
>> > > there
>> > > >>> is
>> > > >>>> a reseek call on the same memstore scanner, we make a fresh call
>> to
>> > > >> build
>> > > >>>> an iterator() on the skip list set - this means we an additional
>> > skip
>> > > >>> list
>> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
>> not
>> > > >>> O(1).
>> > > >>>>
>> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
>> that
>> > > >>> number
>> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
>> > > reverted
>> > > >>> by
>> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
>> > skip
>> > > >>> list
>> > > >>>> lookup rather than being an iterator.
>> > > >>>>
>> > > >>>> Are there any strong reasons against having the previous
>> behaviour
>> > of
>> > > >>>> scanning a small # of keys before degenerating to a skip list
>> > lookup ?
>> > > >>>> Seems like it would really help for sequential memstore scans and
>> > for
>> > > >>>> memstore gets with wide tables (even 10-20 columns).
>> > > >>>>
>> > > >>>> Thanks
>> > > >>>> Varun
>> > > >>
>> > >
>> >
>>
>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by lars hofhansl <la...@apache.org>.
The confusion comes from the delete marker placement.

When you mark all columns for deletion a "family" delete marker is placed and will always sorted at the *beginning* the row, before all other KV for the row in that column family. (Note that a row is marked for delete by placing a family delete marker in each column family)


The logic is that scanner can determine what KVs have been marked for deletion while only scanning forward through the hfiles/memstores.
A family delete marker marks all versions of all columns for deletion so it has to come first.
That also means that any get/scan request has to first seek to the beginning of the row to see if there is a delete marker.


You example would look like this in the end:
(ROW, <DELETE_FAMILY>, T=2)
(ROW, COL1, T=3)
(ROW, COL2, T=3)
(ROW, COL3, T=3)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)


Note that there are two other delete marker types: column and version. These are sorted according to their versions right before the columns/versions they affect.


Does this blog post help: htthp://hadoop-hbase.blogspot.com/2011/12/deletion-in-hbase.html ?

-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org> 
Cc: "user@hbase.apache.org" <us...@hbase.apache.org>; lars hofhansl <la...@apache.org> 
Sent: Tuesday, January 28, 2014 9:04 AM
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
 

Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
because COL2 > COL1 lexicographically. However in the above example, it
comes before the delete marker and hence before (ROW, COL1, T=1) which is
wrong, no ?



On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. Now, clearly there will be columns above the delete marker which are
> smaller than the ones below it.
>
> This is where closer look is needed. Part of the confusion arises from
> usage of > and < in your example.
> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>
> Here, in terms of sort order, 'above' means before. 'below it' would mean
> after. So 'smaller' would mean before.
>
> Cheers
>
>
> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi Ted,
> >
> > Not satisfied with your answer, the document you sent does not talk about
> > Delete ColumnFamily marker sort order. For the delete family marker to
> > work, it has to mask *all* columns of a family. Hence it has to be above
> > all the older columns. All the new columns must come above this column
> > family delete marker. Now, clearly there will be columns above the delete
> > marker which are smaller than the ones below it.
> >
> > The document talks nothing about delete marker order, could you answer
> the
> > question by looking through the example ?
> >
> > Varun
> >
> >
> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Varun:
> > > Take a look at http://hbase.apache.org/book.html#dm.sort
> > >
> > > There's no contradiction.
> > >
> > > Cheers
> > >
> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> > >
> > > > Actually, I now have another question because of the way our work
> load
> > is
> > > > structured. We use a wide schema and each time we write, we delete
> the
> > > > entire row and write a fresh set of columns - we want to make sure no
> > old
> > > > columns survive. So, I just want to see if my picture of the memstore
> > at
> > > > this point is correct or not. My understanding is that Memstore is
> > > > basically a skip list of keyvalues and compares the values using
> > KeyValue
> > > > comparator
> > > >
> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> > > >
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > > > memstore has - my understanding is that we do not delete in the
> > memstore
> > > > but only add markers:
> > > >
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted
> > > above
> > > > the delete.
> > > >
> > > > (ROW, COL1, T=3)
> > > > (ROW, COL2, T=3)
> > > > (ROW, COL3, T=3)
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > This is the ideal scenario for the data to be correctly reflected.
> > > >
> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> > hence,
> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> > > >
> > > > But, we also know that KeyValues compare first by ROW, then by Column
> > and
> > > > then by timestamp in reverse order
> > > >
> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> > > >
> > > > This seems to be contradicting and my main worry is that in a skip
> > list,
> > > it
> > > > is quite possible for skipping to happen as you go through the high
> > level
> > > > express lanes and it could be possible for one of these entries to
> > never
> > > > actually even see the delete marker. For example consider the case
> > above
> > > > where entry #1 and entry #5 form the higher level of the skip list
> and
> > > the
> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
> > and
> > > it
> > > > could end up in the wrong location.
> > > >
> > > > Obviously, if we cleanse all the row contents when a get a ROW level
> > > delete
> > > > marker, we are fine but I want to know if that is the case. If we are
> > not
> > > > really cleansing all the row contents when we get a ROW level delete
> > > > marker, then I want to know why the above scenario can not lead to
> bugs
> > > :)
> > > >
> > > > Varun
> > > >
> > > >
> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > > > <vl...@gmail.com>wrote:
> > > >
> > > >> Varun,
> > > >>
> > > >> There is no need to open new JIRA - there are two already:
> > > >> https://issues.apache.org/jira/browse/HBASE-9769
> > > >> https://issues.apache.org/jira/browse/HBASE-9778
> > > >>
> > > >> Both with patches, you can grab and test them.
> > > >>
> > > >> -Vladimir
> > > >>
> > > >>
> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> > > wrote:
> > > >>
> > > >>> Hi lars,
> > > >>>
> > > >>> Thanks for the background. It seems that for our case, we will have
> > to
> > > >>> consider some solution like the Facebook one, since the next column
> > is
> > > >>> always the next one - this can be a simple flag. I am going to
> raise
> > a
> > > >> JIRA
> > > >>> and we can discuss there.
> > > >>>
> > > >>> Thanks
> > > >>> Varun
> > > >>>
> > > >>>
> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > >>>
> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
> chime
> > in
> > > >>>> soon. :)
> > > >>>>
> > > >>>> Reseek is expensive compared to next if next would get us the KV
> > we're
> > > >>>> looking for. However, HBase does not know that ahead of time.
> There
> > > >> might
> > > >>>> be a 1000 versions of the previous KV to be skipped first.
> > > >>>> HBase seeks in three situation:
> > > >>>> 1. Seek to the next column (there might be a lot of versions to
> > skip)
> > > >>>> 2. Seek to the next row (there might be a lot of versions and
> other
> > > >>>> columns to skip)
> > > >>>> 3. Seek to a row via a hint
> > > >>>>
> > > >>>> #3 is definitely useful, with that one can implement very
> efficient
> > > >> "skip
> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > > >>>> #2 is helpful if there are many columns and one only "selects" a
> few
> > > >> (and
> > > >>>> of course also if there are many versions of columns)
> > > >>>> #1 is only helpful when we expect there to be many versions. Or of
> > the
> > > >>>> size of a typical KV aproaches the block size, since then we'd
> need
> > to
> > > >>> seek
> > > >>>> to the find the next block anyway.
> > > >>>>
> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or
> is
> > > >> that
> > > >>>> just the number of column you return?
> > > >>>>
> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> > instruct
> > > >>> the
> > > >>>> scanner to not seek to the next column or the next row, but just
> > issue
> > > >>>> next()'s until the KV is found. Another suggested approach (I
> think
> > by
> > > >>> the
> > > >>>> Facebook guys) was to issue next() opportunistically a few times,
> > and
> > > >>> only
> > > >>>> when that did not get us to ther requested KV issue a reseek.
> > > >>>> I have also thought of a near/far designation of seeks. For near
> > seeks
> > > >>>> we'd do a configurable number of next()'s first, then seek.
> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> > > >>>>
> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > > >>>>
> > > >>>> I'll look at the trace a bit closers.
> > > >>>> So far my scan profiling has been focused on data in the
> blockcache
> > > >> since
> > > >>>> in the normal case the vast majority of all data is found there
> and
> > > >> only
> > > >>>> recent changes are in the memstore.
> > > >>>>
> > > >>>> -- Lars
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> ________________________________
> > > >>>> From: Varun Sharma <va...@pinterest.com>
> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > > >>> dev@hbase.apache.org"
> > > >>>> <de...@hbase.apache.org>
> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> > > >>>>
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are seeing some unfortunately low performance in the memstore -
> > we
> > > >>> have
> > > >>>> researched some of the previous JIRA(s) and seen some
> inefficiencies
> > > in
> > > >>> the
> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
> > cpu
> > > >> at
> > > >>>> weird points in time - the bug is hard to reproduce and there
> isn't
> > > >> like
> > > >>> a
> > > >>>> huge # of extra reads going to that region server or any
> substantial
> > > >>>> hotspot happening. The region server recovers the moment, we flush
> > the
> > > >>>> memstores or restart the region server. Our queries retrieve wide
> > rows
> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
> > > >>>>
> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> > > >>>> ConcurrentSkipListMap
> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > > >>>> StoreScanner - this is understandable since the rows contain many
> > > >> columns
> > > >>>> and StoreScanner iterates one KeyValue at a time.
> > > >>>>
> > > >>>> So, I was looking at the code and it seems that every single time
> > > there
> > > >>> is
> > > >>>> a reseek call on the same memstore scanner, we make a fresh call
> to
> > > >> build
> > > >>>> an iterator() on the skip list set - this means we an additional
> > skip
> > > >>> list
> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
> not
> > > >>> O(1).
> > > >>>>
> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
> that
> > > >>> number
> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> > > reverted
> > > >>> by
> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
> > skip
> > > >>> list
> > > >>>> lookup rather than being an iterator.
> > > >>>>
> > > >>>> Are there any strong reasons against having the previous behaviour
> > of
> > > >>>> scanning a small # of keys before degenerating to a skip list
> > lookup ?
> > > >>>> Seems like it would really help for sequential memstore scans and
> > for
> > > >>>> memstore gets with wide tables (even 10-20 columns).
> > > >>>>
> > > >>>> Thanks
> > > >>>> Varun
> > > >>
> > >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Ohk I think I understand this better now. So the order will actually be,
something like this, at step #3

(ROW, <DELETE>, T=2)
(ROW, COL1, T=3)
(ROW, COL1, T=1)  - filtered
(ROW, COL2, T=3)
(ROW, COL2, T=1)  - filtered
(ROW, COL3, T=3)
(ROW, COL3, T=1)  - filtered

The ScanDeleteTracker class would simply filter out columns which have a
timestamp < 2.

Varun


On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <va...@pinterest.com> wrote:

> Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
> because COL2 > COL1 lexicographically. However in the above example, it
> comes before the delete marker and hence before (ROW, COL1, T=1) which is
> wrong, no ?
>
>
> On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> bq. Now, clearly there will be columns above the delete marker which are
>> smaller than the ones below it.
>>
>> This is where closer look is needed. Part of the confusion arises from
>> usage of > and < in your example.
>> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>>
>> Here, in terms of sort order, 'above' means before. 'below it' would mean
>> after. So 'smaller' would mean before.
>>
>> Cheers
>>
>>
>> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi Ted,
>> >
>> > Not satisfied with your answer, the document you sent does not talk
>> about
>> > Delete ColumnFamily marker sort order. For the delete family marker to
>> > work, it has to mask *all* columns of a family. Hence it has to be above
>> > all the older columns. All the new columns must come above this column
>> > family delete marker. Now, clearly there will be columns above the
>> delete
>> > marker which are smaller than the ones below it.
>> >
>> > The document talks nothing about delete marker order, could you answer
>> the
>> > question by looking through the example ?
>> >
>> > Varun
>> >
>> >
>> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > Varun:
>> > > Take a look at http://hbase.apache.org/book.html#dm.sort
>> > >
>> > > There's no contradiction.
>> > >
>> > > Cheers
>> > >
>> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>> > >
>> > > > Actually, I now have another question because of the way our work
>> load
>> > is
>> > > > structured. We use a wide schema and each time we write, we delete
>> the
>> > > > entire row and write a fresh set of columns - we want to make sure
>> no
>> > old
>> > > > columns survive. So, I just want to see if my picture of the
>> memstore
>> > at
>> > > > this point is correct or not. My understanding is that Memstore is
>> > > > basically a skip list of keyvalues and compares the values using
>> > KeyValue
>> > > > comparator
>> > > >
>> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
>> > > >
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
>> > > > memstore has - my understanding is that we do not delete in the
>> > memstore
>> > > > but only add markers:
>> > > >
>> > > > (ROW, <DELETE>, T=2)
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted
>> > > above
>> > > > the delete.
>> > > >
>> > > > (ROW, COL1, T=3)
>> > > > (ROW, COL2, T=3)
>> > > > (ROW, COL3, T=3)
>> > > > (ROW, <DELETE>, T=2)
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > This is the ideal scenario for the data to be correctly reflected.
>> > > >
>> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
>> > hence,
>> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
>> > > >
>> > > > But, we also know that KeyValues compare first by ROW, then by
>> Column
>> > and
>> > > > then by timestamp in reverse order
>> > > >
>> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
>> > > >
>> > > > This seems to be contradicting and my main worry is that in a skip
>> > list,
>> > > it
>> > > > is quite possible for skipping to happen as you go through the high
>> > level
>> > > > express lanes and it could be possible for one of these entries to
>> > never
>> > > > actually even see the delete marker. For example consider the case
>> > above
>> > > > where entry #1 and entry #5 form the higher level of the skip list
>> and
>> > > the
>> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
>> > and
>> > > it
>> > > > could end up in the wrong location.
>> > > >
>> > > > Obviously, if we cleanse all the row contents when a get a ROW level
>> > > delete
>> > > > marker, we are fine but I want to know if that is the case. If we
>> are
>> > not
>> > > > really cleansing all the row contents when we get a ROW level delete
>> > > > marker, then I want to know why the above scenario can not lead to
>> bugs
>> > > :)
>> > > >
>> > > > Varun
>> > > >
>> > > >
>> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
>> > > > <vl...@gmail.com>wrote:
>> > > >
>> > > >> Varun,
>> > > >>
>> > > >> There is no need to open new JIRA - there are two already:
>> > > >> https://issues.apache.org/jira/browse/HBASE-9769
>> > > >> https://issues.apache.org/jira/browse/HBASE-9778
>> > > >>
>> > > >> Both with patches, you can grab and test them.
>> > > >>
>> > > >> -Vladimir
>> > > >>
>> > > >>
>> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <varun@pinterest.com
>> >
>> > > wrote:
>> > > >>
>> > > >>> Hi lars,
>> > > >>>
>> > > >>> Thanks for the background. It seems that for our case, we will
>> have
>> > to
>> > > >>> consider some solution like the Facebook one, since the next
>> column
>> > is
>> > > >>> always the next one - this can be a simple flag. I am going to
>> raise
>> > a
>> > > >> JIRA
>> > > >>> and we can discuss there.
>> > > >>>
>> > > >>> Thanks
>> > > >>> Varun
>> > > >>>
>> > > >>>
>> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
>> > > wrote:
>> > > >>>
>> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
>> chime
>> > in
>> > > >>>> soon. :)
>> > > >>>>
>> > > >>>> Reseek is expensive compared to next if next would get us the KV
>> > we're
>> > > >>>> looking for. However, HBase does not know that ahead of time.
>> There
>> > > >> might
>> > > >>>> be a 1000 versions of the previous KV to be skipped first.
>> > > >>>> HBase seeks in three situation:
>> > > >>>> 1. Seek to the next column (there might be a lot of versions to
>> > skip)
>> > > >>>> 2. Seek to the next row (there might be a lot of versions and
>> other
>> > > >>>> columns to skip)
>> > > >>>> 3. Seek to a row via a hint
>> > > >>>>
>> > > >>>> #3 is definitely useful, with that one can implement very
>> efficient
>> > > >> "skip
>> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>> > > >>>> #2 is helpful if there are many columns and one only "selects" a
>> few
>> > > >> (and
>> > > >>>> of course also if there are many versions of columns)
>> > > >>>> #1 is only helpful when we expect there to be many versions. Or
>> of
>> > the
>> > > >>>> size of a typical KV aproaches the block size, since then we'd
>> need
>> > to
>> > > >>> seek
>> > > >>>> to the find the next block anyway.
>> > > >>>>
>> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or
>> is
>> > > >> that
>> > > >>>> just the number of column you return?
>> > > >>>>
>> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
>> > instruct
>> > > >>> the
>> > > >>>> scanner to not seek to the next column or the next row, but just
>> > issue
>> > > >>>> next()'s until the KV is found. Another suggested approach (I
>> think
>> > by
>> > > >>> the
>> > > >>>> Facebook guys) was to issue next() opportunistically a few times,
>> > and
>> > > >>> only
>> > > >>>> when that did not get us to ther requested KV issue a reseek.
>> > > >>>> I have also thought of a near/far designation of seeks. For near
>> > seeks
>> > > >>>> we'd do a configurable number of next()'s first, then seek.
>> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
>> > > >>>>
>> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>> > > >>>>
>> > > >>>> I'll look at the trace a bit closers.
>> > > >>>> So far my scan profiling has been focused on data in the
>> blockcache
>> > > >> since
>> > > >>>> in the normal case the vast majority of all data is found there
>> and
>> > > >> only
>> > > >>>> recent changes are in the memstore.
>> > > >>>>
>> > > >>>> -- Lars
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> ________________________________
>> > > >>>> From: Varun Sharma <va...@pinterest.com>
>> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
>> > > >>> dev@hbase.apache.org"
>> > > >>>> <de...@hbase.apache.org>
>> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
>> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>> > > >>>>
>> > > >>>>
>> > > >>>> Hi,
>> > > >>>>
>> > > >>>> We are seeing some unfortunately low performance in the memstore
>> -
>> > we
>> > > >>> have
>> > > >>>> researched some of the previous JIRA(s) and seen some
>> inefficiencies
>> > > in
>> > > >>> the
>> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100
>> %
>> > cpu
>> > > >> at
>> > > >>>> weird points in time - the bug is hard to reproduce and there
>> isn't
>> > > >> like
>> > > >>> a
>> > > >>>> huge # of extra reads going to that region server or any
>> substantial
>> > > >>>> hotspot happening. The region server recovers the moment, we
>> flush
>> > the
>> > > >>>> memstores or restart the region server. Our queries retrieve wide
>> > rows
>> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
>> > > >>>>
>> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>> > > >>>> ConcurrentSkipListMap
>> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>> > > >>>> StoreScanner - this is understandable since the rows contain many
>> > > >> columns
>> > > >>>> and StoreScanner iterates one KeyValue at a time.
>> > > >>>>
>> > > >>>> So, I was looking at the code and it seems that every single time
>> > > there
>> > > >>> is
>> > > >>>> a reseek call on the same memstore scanner, we make a fresh call
>> to
>> > > >> build
>> > > >>>> an iterator() on the skip list set - this means we an additional
>> > skip
>> > > >>> list
>> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
>> not
>> > > >>> O(1).
>> > > >>>>
>> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
>> that
>> > > >>> number
>> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
>> > > reverted
>> > > >>> by
>> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
>> > skip
>> > > >>> list
>> > > >>>> lookup rather than being an iterator.
>> > > >>>>
>> > > >>>> Are there any strong reasons against having the previous
>> behaviour
>> > of
>> > > >>>> scanning a small # of keys before degenerating to a skip list
>> > lookup ?
>> > > >>>> Seems like it would really help for sequential memstore scans and
>> > for
>> > > >>>> memstore gets with wide tables (even 10-20 columns).
>> > > >>>>
>> > > >>>> Thanks
>> > > >>>> Varun
>> > > >>
>> > >
>> >
>>
>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
because COL2 > COL1 lexicographically. However in the above example, it
comes before the delete marker and hence before (ROW, COL1, T=1) which is
wrong, no ?


On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. Now, clearly there will be columns above the delete marker which are
> smaller than the ones below it.
>
> This is where closer look is needed. Part of the confusion arises from
> usage of > and < in your example.
> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>
> Here, in terms of sort order, 'above' means before. 'below it' would mean
> after. So 'smaller' would mean before.
>
> Cheers
>
>
> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi Ted,
> >
> > Not satisfied with your answer, the document you sent does not talk about
> > Delete ColumnFamily marker sort order. For the delete family marker to
> > work, it has to mask *all* columns of a family. Hence it has to be above
> > all the older columns. All the new columns must come above this column
> > family delete marker. Now, clearly there will be columns above the delete
> > marker which are smaller than the ones below it.
> >
> > The document talks nothing about delete marker order, could you answer
> the
> > question by looking through the example ?
> >
> > Varun
> >
> >
> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Varun:
> > > Take a look at http://hbase.apache.org/book.html#dm.sort
> > >
> > > There's no contradiction.
> > >
> > > Cheers
> > >
> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> > >
> > > > Actually, I now have another question because of the way our work
> load
> > is
> > > > structured. We use a wide schema and each time we write, we delete
> the
> > > > entire row and write a fresh set of columns - we want to make sure no
> > old
> > > > columns survive. So, I just want to see if my picture of the memstore
> > at
> > > > this point is correct or not. My understanding is that Memstore is
> > > > basically a skip list of keyvalues and compares the values using
> > KeyValue
> > > > comparator
> > > >
> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> > > >
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > > > memstore has - my understanding is that we do not delete in the
> > memstore
> > > > but only add markers:
> > > >
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted
> > > above
> > > > the delete.
> > > >
> > > > (ROW, COL1, T=3)
> > > > (ROW, COL2, T=3)
> > > > (ROW, COL3, T=3)
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > This is the ideal scenario for the data to be correctly reflected.
> > > >
> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> > hence,
> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> > > >
> > > > But, we also know that KeyValues compare first by ROW, then by Column
> > and
> > > > then by timestamp in reverse order
> > > >
> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> > > >
> > > > This seems to be contradicting and my main worry is that in a skip
> > list,
> > > it
> > > > is quite possible for skipping to happen as you go through the high
> > level
> > > > express lanes and it could be possible for one of these entries to
> > never
> > > > actually even see the delete marker. For example consider the case
> > above
> > > > where entry #1 and entry #5 form the higher level of the skip list
> and
> > > the
> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
> > and
> > > it
> > > > could end up in the wrong location.
> > > >
> > > > Obviously, if we cleanse all the row contents when a get a ROW level
> > > delete
> > > > marker, we are fine but I want to know if that is the case. If we are
> > not
> > > > really cleansing all the row contents when we get a ROW level delete
> > > > marker, then I want to know why the above scenario can not lead to
> bugs
> > > :)
> > > >
> > > > Varun
> > > >
> > > >
> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > > > <vl...@gmail.com>wrote:
> > > >
> > > >> Varun,
> > > >>
> > > >> There is no need to open new JIRA - there are two already:
> > > >> https://issues.apache.org/jira/browse/HBASE-9769
> > > >> https://issues.apache.org/jira/browse/HBASE-9778
> > > >>
> > > >> Both with patches, you can grab and test them.
> > > >>
> > > >> -Vladimir
> > > >>
> > > >>
> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> > > wrote:
> > > >>
> > > >>> Hi lars,
> > > >>>
> > > >>> Thanks for the background. It seems that for our case, we will have
> > to
> > > >>> consider some solution like the Facebook one, since the next column
> > is
> > > >>> always the next one - this can be a simple flag. I am going to
> raise
> > a
> > > >> JIRA
> > > >>> and we can discuss there.
> > > >>>
> > > >>> Thanks
> > > >>> Varun
> > > >>>
> > > >>>
> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > >>>
> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
> chime
> > in
> > > >>>> soon. :)
> > > >>>>
> > > >>>> Reseek is expensive compared to next if next would get us the KV
> > we're
> > > >>>> looking for. However, HBase does not know that ahead of time.
> There
> > > >> might
> > > >>>> be a 1000 versions of the previous KV to be skipped first.
> > > >>>> HBase seeks in three situation:
> > > >>>> 1. Seek to the next column (there might be a lot of versions to
> > skip)
> > > >>>> 2. Seek to the next row (there might be a lot of versions and
> other
> > > >>>> columns to skip)
> > > >>>> 3. Seek to a row via a hint
> > > >>>>
> > > >>>> #3 is definitely useful, with that one can implement very
> efficient
> > > >> "skip
> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > > >>>> #2 is helpful if there are many columns and one only "selects" a
> few
> > > >> (and
> > > >>>> of course also if there are many versions of columns)
> > > >>>> #1 is only helpful when we expect there to be many versions. Or of
> > the
> > > >>>> size of a typical KV aproaches the block size, since then we'd
> need
> > to
> > > >>> seek
> > > >>>> to the find the next block anyway.
> > > >>>>
> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or
> is
> > > >> that
> > > >>>> just the number of column you return?
> > > >>>>
> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> > instruct
> > > >>> the
> > > >>>> scanner to not seek to the next column or the next row, but just
> > issue
> > > >>>> next()'s until the KV is found. Another suggested approach (I
> think
> > by
> > > >>> the
> > > >>>> Facebook guys) was to issue next() opportunistically a few times,
> > and
> > > >>> only
> > > >>>> when that did not get us to ther requested KV issue a reseek.
> > > >>>> I have also thought of a near/far designation of seeks. For near
> > seeks
> > > >>>> we'd do a configurable number of next()'s first, then seek.
> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> > > >>>>
> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > > >>>>
> > > >>>> I'll look at the trace a bit closers.
> > > >>>> So far my scan profiling has been focused on data in the
> blockcache
> > > >> since
> > > >>>> in the normal case the vast majority of all data is found there
> and
> > > >> only
> > > >>>> recent changes are in the memstore.
> > > >>>>
> > > >>>> -- Lars
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> ________________________________
> > > >>>> From: Varun Sharma <va...@pinterest.com>
> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > > >>> dev@hbase.apache.org"
> > > >>>> <de...@hbase.apache.org>
> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> > > >>>>
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are seeing some unfortunately low performance in the memstore -
> > we
> > > >>> have
> > > >>>> researched some of the previous JIRA(s) and seen some
> inefficiencies
> > > in
> > > >>> the
> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
> > cpu
> > > >> at
> > > >>>> weird points in time - the bug is hard to reproduce and there
> isn't
> > > >> like
> > > >>> a
> > > >>>> huge # of extra reads going to that region server or any
> substantial
> > > >>>> hotspot happening. The region server recovers the moment, we flush
> > the
> > > >>>> memstores or restart the region server. Our queries retrieve wide
> > rows
> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
> > > >>>>
> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> > > >>>> ConcurrentSkipListMap
> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > > >>>> StoreScanner - this is understandable since the rows contain many
> > > >> columns
> > > >>>> and StoreScanner iterates one KeyValue at a time.
> > > >>>>
> > > >>>> So, I was looking at the code and it seems that every single time
> > > there
> > > >>> is
> > > >>>> a reseek call on the same memstore scanner, we make a fresh call
> to
> > > >> build
> > > >>>> an iterator() on the skip list set - this means we an additional
> > skip
> > > >>> list
> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
> not
> > > >>> O(1).
> > > >>>>
> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
> that
> > > >>> number
> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> > > reverted
> > > >>> by
> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
> > skip
> > > >>> list
> > > >>>> lookup rather than being an iterator.
> > > >>>>
> > > >>>> Are there any strong reasons against having the previous behaviour
> > of
> > > >>>> scanning a small # of keys before degenerating to a skip list
> > lookup ?
> > > >>>> Seems like it would really help for sequential memstore scans and
> > for
> > > >>>> memstore gets with wide tables (even 10-20 columns).
> > > >>>>
> > > >>>> Thanks
> > > >>>> Varun
> > > >>
> > >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
because COL2 > COL1 lexicographically. However in the above example, it
comes before the delete marker and hence before (ROW, COL1, T=1) which is
wrong, no ?


On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. Now, clearly there will be columns above the delete marker which are
> smaller than the ones below it.
>
> This is where closer look is needed. Part of the confusion arises from
> usage of > and < in your example.
> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>
> Here, in terms of sort order, 'above' means before. 'below it' would mean
> after. So 'smaller' would mean before.
>
> Cheers
>
>
> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi Ted,
> >
> > Not satisfied with your answer, the document you sent does not talk about
> > Delete ColumnFamily marker sort order. For the delete family marker to
> > work, it has to mask *all* columns of a family. Hence it has to be above
> > all the older columns. All the new columns must come above this column
> > family delete marker. Now, clearly there will be columns above the delete
> > marker which are smaller than the ones below it.
> >
> > The document talks nothing about delete marker order, could you answer
> the
> > question by looking through the example ?
> >
> > Varun
> >
> >
> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Varun:
> > > Take a look at http://hbase.apache.org/book.html#dm.sort
> > >
> > > There's no contradiction.
> > >
> > > Cheers
> > >
> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> > >
> > > > Actually, I now have another question because of the way our work
> load
> > is
> > > > structured. We use a wide schema and each time we write, we delete
> the
> > > > entire row and write a fresh set of columns - we want to make sure no
> > old
> > > > columns survive. So, I just want to see if my picture of the memstore
> > at
> > > > this point is correct or not. My understanding is that Memstore is
> > > > basically a skip list of keyvalues and compares the values using
> > KeyValue
> > > > comparator
> > > >
> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> > > >
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > > > memstore has - my understanding is that we do not delete in the
> > memstore
> > > > but only add markers:
> > > >
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted
> > > above
> > > > the delete.
> > > >
> > > > (ROW, COL1, T=3)
> > > > (ROW, COL2, T=3)
> > > > (ROW, COL3, T=3)
> > > > (ROW, <DELETE>, T=2)
> > > > (ROW, COL1, T=1)
> > > > (ROW, COL2, T=1)
> > > > (ROW, COL3, T=1)
> > > >
> > > > This is the ideal scenario for the data to be correctly reflected.
> > > >
> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> > hence,
> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> > > >
> > > > But, we also know that KeyValues compare first by ROW, then by Column
> > and
> > > > then by timestamp in reverse order
> > > >
> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> > > >
> > > > This seems to be contradicting and my main worry is that in a skip
> > list,
> > > it
> > > > is quite possible for skipping to happen as you go through the high
> > level
> > > > express lanes and it could be possible for one of these entries to
> > never
> > > > actually even see the delete marker. For example consider the case
> > above
> > > > where entry #1 and entry #5 form the higher level of the skip list
> and
> > > the
> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
> > and
> > > it
> > > > could end up in the wrong location.
> > > >
> > > > Obviously, if we cleanse all the row contents when a get a ROW level
> > > delete
> > > > marker, we are fine but I want to know if that is the case. If we are
> > not
> > > > really cleansing all the row contents when we get a ROW level delete
> > > > marker, then I want to know why the above scenario can not lead to
> bugs
> > > :)
> > > >
> > > > Varun
> > > >
> > > >
> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > > > <vl...@gmail.com>wrote:
> > > >
> > > >> Varun,
> > > >>
> > > >> There is no need to open new JIRA - there are two already:
> > > >> https://issues.apache.org/jira/browse/HBASE-9769
> > > >> https://issues.apache.org/jira/browse/HBASE-9778
> > > >>
> > > >> Both with patches, you can grab and test them.
> > > >>
> > > >> -Vladimir
> > > >>
> > > >>
> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> > > wrote:
> > > >>
> > > >>> Hi lars,
> > > >>>
> > > >>> Thanks for the background. It seems that for our case, we will have
> > to
> > > >>> consider some solution like the Facebook one, since the next column
> > is
> > > >>> always the next one - this can be a simple flag. I am going to
> raise
> > a
> > > >> JIRA
> > > >>> and we can discuss there.
> > > >>>
> > > >>> Thanks
> > > >>> Varun
> > > >>>
> > > >>>
> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> > > wrote:
> > > >>>
> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
> chime
> > in
> > > >>>> soon. :)
> > > >>>>
> > > >>>> Reseek is expensive compared to next if next would get us the KV
> > we're
> > > >>>> looking for. However, HBase does not know that ahead of time.
> There
> > > >> might
> > > >>>> be a 1000 versions of the previous KV to be skipped first.
> > > >>>> HBase seeks in three situation:
> > > >>>> 1. Seek to the next column (there might be a lot of versions to
> > skip)
> > > >>>> 2. Seek to the next row (there might be a lot of versions and
> other
> > > >>>> columns to skip)
> > > >>>> 3. Seek to a row via a hint
> > > >>>>
> > > >>>> #3 is definitely useful, with that one can implement very
> efficient
> > > >> "skip
> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > > >>>> #2 is helpful if there are many columns and one only "selects" a
> few
> > > >> (and
> > > >>>> of course also if there are many versions of columns)
> > > >>>> #1 is only helpful when we expect there to be many versions. Or of
> > the
> > > >>>> size of a typical KV aproaches the block size, since then we'd
> need
> > to
> > > >>> seek
> > > >>>> to the find the next block anyway.
> > > >>>>
> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or
> is
> > > >> that
> > > >>>> just the number of column you return?
> > > >>>>
> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> > instruct
> > > >>> the
> > > >>>> scanner to not seek to the next column or the next row, but just
> > issue
> > > >>>> next()'s until the KV is found. Another suggested approach (I
> think
> > by
> > > >>> the
> > > >>>> Facebook guys) was to issue next() opportunistically a few times,
> > and
> > > >>> only
> > > >>>> when that did not get us to ther requested KV issue a reseek.
> > > >>>> I have also thought of a near/far designation of seeks. For near
> > seeks
> > > >>>> we'd do a configurable number of next()'s first, then seek.
> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> > > >>>>
> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > > >>>>
> > > >>>> I'll look at the trace a bit closers.
> > > >>>> So far my scan profiling has been focused on data in the
> blockcache
> > > >> since
> > > >>>> in the normal case the vast majority of all data is found there
> and
> > > >> only
> > > >>>> recent changes are in the memstore.
> > > >>>>
> > > >>>> -- Lars
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> ________________________________
> > > >>>> From: Varun Sharma <va...@pinterest.com>
> > > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > > >>> dev@hbase.apache.org"
> > > >>>> <de...@hbase.apache.org>
> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> > > >>>>
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are seeing some unfortunately low performance in the memstore -
> > we
> > > >>> have
> > > >>>> researched some of the previous JIRA(s) and seen some
> inefficiencies
> > > in
> > > >>> the
> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
> > cpu
> > > >> at
> > > >>>> weird points in time - the bug is hard to reproduce and there
> isn't
> > > >> like
> > > >>> a
> > > >>>> huge # of extra reads going to that region server or any
> substantial
> > > >>>> hotspot happening. The region server recovers the moment, we flush
> > the
> > > >>>> memstores or restart the region server. Our queries retrieve wide
> > rows
> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
> > > >>>>
> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> > > >>>> ConcurrentSkipListMap
> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > > >>>> StoreScanner - this is understandable since the rows contain many
> > > >> columns
> > > >>>> and StoreScanner iterates one KeyValue at a time.
> > > >>>>
> > > >>>> So, I was looking at the code and it seems that every single time
> > > there
> > > >>> is
> > > >>>> a reseek call on the same memstore scanner, we make a fresh call
> to
> > > >> build
> > > >>>> an iterator() on the skip list set - this means we an additional
> > skip
> > > >>> list
> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
> not
> > > >>> O(1).
> > > >>>>
> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
> that
> > > >>> number
> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> > > reverted
> > > >>> by
> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
> > skip
> > > >>> list
> > > >>>> lookup rather than being an iterator.
> > > >>>>
> > > >>>> Are there any strong reasons against having the previous behaviour
> > of
> > > >>>> scanning a small # of keys before degenerating to a skip list
> > lookup ?
> > > >>>> Seems like it would really help for sequential memstore scans and
> > for
> > > >>>> memstore gets with wide tables (even 10-20 columns).
> > > >>>>
> > > >>>> Thanks
> > > >>>> Varun
> > > >>
> > >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Ted Yu <yu...@gmail.com>.
bq. Now, clearly there will be columns above the delete marker which are
smaller than the ones below it.

This is where closer look is needed. Part of the confusion arises from
usage of > and < in your example.
(ROW, COL2, T=3) would sort before (ROW, COL1, T=1).

Here, in terms of sort order, 'above' means before. 'below it' would mean
after. So 'smaller' would mean before.

Cheers


On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Ted,
>
> Not satisfied with your answer, the document you sent does not talk about
> Delete ColumnFamily marker sort order. For the delete family marker to
> work, it has to mask *all* columns of a family. Hence it has to be above
> all the older columns. All the new columns must come above this column
> family delete marker. Now, clearly there will be columns above the delete
> marker which are smaller than the ones below it.
>
> The document talks nothing about delete marker order, could you answer the
> question by looking through the example ?
>
> Varun
>
>
> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Varun:
> > Take a look at http://hbase.apache.org/book.html#dm.sort
> >
> > There's no contradiction.
> >
> > Cheers
> >
> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:
> >
> > > Actually, I now have another question because of the way our work load
> is
> > > structured. We use a wide schema and each time we write, we delete the
> > > entire row and write a fresh set of columns - we want to make sure no
> old
> > > columns survive. So, I just want to see if my picture of the memstore
> at
> > > this point is correct or not. My understanding is that Memstore is
> > > basically a skip list of keyvalues and compares the values using
> KeyValue
> > > comparator
> > >
> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> > >
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > > memstore has - my understanding is that we do not delete in the
> memstore
> > > but only add markers:
> > >
> > > (ROW, <DELETE>, T=2)
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > 3) Now we write our new fresh row for *T=3* - it should get inserted
> > above
> > > the delete.
> > >
> > > (ROW, COL1, T=3)
> > > (ROW, COL2, T=3)
> > > (ROW, COL3, T=3)
> > > (ROW, <DELETE>, T=2)
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > This is the ideal scenario for the data to be correctly reflected.
> > >
> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> hence,
> > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> > >
> > > But, we also know that KeyValues compare first by ROW, then by Column
> and
> > > then by timestamp in reverse order
> > >
> > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> > >
> > > This seems to be contradicting and my main worry is that in a skip
> list,
> > it
> > > is quite possible for skipping to happen as you go through the high
> level
> > > express lanes and it could be possible for one of these entries to
> never
> > > actually even see the delete marker. For example consider the case
> above
> > > where entry #1 and entry #5 form the higher level of the skip list and
> > the
> > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
> and
> > it
> > > could end up in the wrong location.
> > >
> > > Obviously, if we cleanse all the row contents when a get a ROW level
> > delete
> > > marker, we are fine but I want to know if that is the case. If we are
> not
> > > really cleansing all the row contents when we get a ROW level delete
> > > marker, then I want to know why the above scenario can not lead to bugs
> > :)
> > >
> > > Varun
> > >
> > >
> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > > <vl...@gmail.com>wrote:
> > >
> > >> Varun,
> > >>
> > >> There is no need to open new JIRA - there are two already:
> > >> https://issues.apache.org/jira/browse/HBASE-9769
> > >> https://issues.apache.org/jira/browse/HBASE-9778
> > >>
> > >> Both with patches, you can grab and test them.
> > >>
> > >> -Vladimir
> > >>
> > >>
> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> > wrote:
> > >>
> > >>> Hi lars,
> > >>>
> > >>> Thanks for the background. It seems that for our case, we will have
> to
> > >>> consider some solution like the Facebook one, since the next column
> is
> > >>> always the next one - this can be a simple flag. I am going to raise
> a
> > >> JIRA
> > >>> and we can discuss there.
> > >>>
> > >>> Thanks
> > >>> Varun
> > >>>
> > >>>
> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > >>>
> > >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime
> in
> > >>>> soon. :)
> > >>>>
> > >>>> Reseek is expensive compared to next if next would get us the KV
> we're
> > >>>> looking for. However, HBase does not know that ahead of time. There
> > >> might
> > >>>> be a 1000 versions of the previous KV to be skipped first.
> > >>>> HBase seeks in three situation:
> > >>>> 1. Seek to the next column (there might be a lot of versions to
> skip)
> > >>>> 2. Seek to the next row (there might be a lot of versions and other
> > >>>> columns to skip)
> > >>>> 3. Seek to a row via a hint
> > >>>>
> > >>>> #3 is definitely useful, with that one can implement very efficient
> > >> "skip
> > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > >>>> #2 is helpful if there are many columns and one only "selects" a few
> > >> (and
> > >>>> of course also if there are many versions of columns)
> > >>>> #1 is only helpful when we expect there to be many versions. Or of
> the
> > >>>> size of a typical KV aproaches the block size, since then we'd need
> to
> > >>> seek
> > >>>> to the find the next block anyway.
> > >>>>
> > >>>> You might well be a victim of #1. Are your rows 10-20 columns or is
> > >> that
> > >>>> just the number of column you return?
> > >>>>
> > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> instruct
> > >>> the
> > >>>> scanner to not seek to the next column or the next row, but just
> issue
> > >>>> next()'s until the KV is found. Another suggested approach (I think
> by
> > >>> the
> > >>>> Facebook guys) was to issue next() opportunistically a few times,
> and
> > >>> only
> > >>>> when that did not get us to ther requested KV issue a reseek.
> > >>>> I have also thought of a near/far designation of seeks. For near
> seeks
> > >>>> we'd do a configurable number of next()'s first, then seek.
> > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> > >>>>
> > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > >>>>
> > >>>> I'll look at the trace a bit closers.
> > >>>> So far my scan profiling has been focused on data in the blockcache
> > >> since
> > >>>> in the normal case the vast majority of all data is found there and
> > >> only
> > >>>> recent changes are in the memstore.
> > >>>>
> > >>>> -- Lars
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> ________________________________
> > >>>> From: Varun Sharma <va...@pinterest.com>
> > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > >>> dev@hbase.apache.org"
> > >>>> <de...@hbase.apache.org>
> > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> > >>>>
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> We are seeing some unfortunately low performance in the memstore -
> we
> > >>> have
> > >>>> researched some of the previous JIRA(s) and seen some inefficiencies
> > in
> > >>> the
> > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
> cpu
> > >> at
> > >>>> weird points in time - the bug is hard to reproduce and there isn't
> > >> like
> > >>> a
> > >>>> huge # of extra reads going to that region server or any substantial
> > >>>> hotspot happening. The region server recovers the moment, we flush
> the
> > >>>> memstores or restart the region server. Our queries retrieve wide
> rows
> > >>>> which are upto 10-20 columns. A stack trace shows two things:
> > >>>>
> > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> > >>>> ConcurrentSkipListMap
> > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > >>>> StoreScanner - this is understandable since the rows contain many
> > >> columns
> > >>>> and StoreScanner iterates one KeyValue at a time.
> > >>>>
> > >>>> So, I was looking at the code and it seems that every single time
> > there
> > >>> is
> > >>>> a reseek call on the same memstore scanner, we make a fresh call to
> > >> build
> > >>>> an iterator() on the skip list set - this means we an additional
> skip
> > >>> list
> > >>>> lookup for every column retrieved. SkipList lookups are O(n) and not
> > >>> O(1).
> > >>>>
> > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> > >>> number
> > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> > reverted
> > >>> by
> > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
> skip
> > >>> list
> > >>>> lookup rather than being an iterator.
> > >>>>
> > >>>> Are there any strong reasons against having the previous behaviour
> of
> > >>>> scanning a small # of keys before degenerating to a skip list
> lookup ?
> > >>>> Seems like it would really help for sequential memstore scans and
> for
> > >>>> memstore gets with wide tables (even 10-20 columns).
> > >>>>
> > >>>> Thanks
> > >>>> Varun
> > >>
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Ted Yu <yu...@gmail.com>.
bq. Now, clearly there will be columns above the delete marker which are
smaller than the ones below it.

This is where closer look is needed. Part of the confusion arises from
usage of > and < in your example.
(ROW, COL2, T=3) would sort before (ROW, COL1, T=1).

Here, in terms of sort order, 'above' means before. 'below it' would mean
after. So 'smaller' would mean before.

Cheers


On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Ted,
>
> Not satisfied with your answer, the document you sent does not talk about
> Delete ColumnFamily marker sort order. For the delete family marker to
> work, it has to mask *all* columns of a family. Hence it has to be above
> all the older columns. All the new columns must come above this column
> family delete marker. Now, clearly there will be columns above the delete
> marker which are smaller than the ones below it.
>
> The document talks nothing about delete marker order, could you answer the
> question by looking through the example ?
>
> Varun
>
>
> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Varun:
> > Take a look at http://hbase.apache.org/book.html#dm.sort
> >
> > There's no contradiction.
> >
> > Cheers
> >
> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:
> >
> > > Actually, I now have another question because of the way our work load
> is
> > > structured. We use a wide schema and each time we write, we delete the
> > > entire row and write a fresh set of columns - we want to make sure no
> old
> > > columns survive. So, I just want to see if my picture of the memstore
> at
> > > this point is correct or not. My understanding is that Memstore is
> > > basically a skip list of keyvalues and compares the values using
> KeyValue
> > > comparator
> > >
> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> > >
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > > memstore has - my understanding is that we do not delete in the
> memstore
> > > but only add markers:
> > >
> > > (ROW, <DELETE>, T=2)
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > 3) Now we write our new fresh row for *T=3* - it should get inserted
> > above
> > > the delete.
> > >
> > > (ROW, COL1, T=3)
> > > (ROW, COL2, T=3)
> > > (ROW, COL3, T=3)
> > > (ROW, <DELETE>, T=2)
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > This is the ideal scenario for the data to be correctly reflected.
> > >
> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
> hence,
> > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> > >
> > > But, we also know that KeyValues compare first by ROW, then by Column
> and
> > > then by timestamp in reverse order
> > >
> > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> > >
> > > This seems to be contradicting and my main worry is that in a skip
> list,
> > it
> > > is quite possible for skipping to happen as you go through the high
> level
> > > express lanes and it could be possible for one of these entries to
> never
> > > actually even see the delete marker. For example consider the case
> above
> > > where entry #1 and entry #5 form the higher level of the skip list and
> > the
> > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
> and
> > it
> > > could end up in the wrong location.
> > >
> > > Obviously, if we cleanse all the row contents when a get a ROW level
> > delete
> > > marker, we are fine but I want to know if that is the case. If we are
> not
> > > really cleansing all the row contents when we get a ROW level delete
> > > marker, then I want to know why the above scenario can not lead to bugs
> > :)
> > >
> > > Varun
> > >
> > >
> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > > <vl...@gmail.com>wrote:
> > >
> > >> Varun,
> > >>
> > >> There is no need to open new JIRA - there are two already:
> > >> https://issues.apache.org/jira/browse/HBASE-9769
> > >> https://issues.apache.org/jira/browse/HBASE-9778
> > >>
> > >> Both with patches, you can grab and test them.
> > >>
> > >> -Vladimir
> > >>
> > >>
> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> > wrote:
> > >>
> > >>> Hi lars,
> > >>>
> > >>> Thanks for the background. It seems that for our case, we will have
> to
> > >>> consider some solution like the Facebook one, since the next column
> is
> > >>> always the next one - this can be a simple flag. I am going to raise
> a
> > >> JIRA
> > >>> and we can discuss there.
> > >>>
> > >>> Thanks
> > >>> Varun
> > >>>
> > >>>
> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > >>>
> > >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime
> in
> > >>>> soon. :)
> > >>>>
> > >>>> Reseek is expensive compared to next if next would get us the KV
> we're
> > >>>> looking for. However, HBase does not know that ahead of time. There
> > >> might
> > >>>> be a 1000 versions of the previous KV to be skipped first.
> > >>>> HBase seeks in three situation:
> > >>>> 1. Seek to the next column (there might be a lot of versions to
> skip)
> > >>>> 2. Seek to the next row (there might be a lot of versions and other
> > >>>> columns to skip)
> > >>>> 3. Seek to a row via a hint
> > >>>>
> > >>>> #3 is definitely useful, with that one can implement very efficient
> > >> "skip
> > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > >>>> #2 is helpful if there are many columns and one only "selects" a few
> > >> (and
> > >>>> of course also if there are many versions of columns)
> > >>>> #1 is only helpful when we expect there to be many versions. Or of
> the
> > >>>> size of a typical KV aproaches the block size, since then we'd need
> to
> > >>> seek
> > >>>> to the find the next block anyway.
> > >>>>
> > >>>> You might well be a victim of #1. Are your rows 10-20 columns or is
> > >> that
> > >>>> just the number of column you return?
> > >>>>
> > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> instruct
> > >>> the
> > >>>> scanner to not seek to the next column or the next row, but just
> issue
> > >>>> next()'s until the KV is found. Another suggested approach (I think
> by
> > >>> the
> > >>>> Facebook guys) was to issue next() opportunistically a few times,
> and
> > >>> only
> > >>>> when that did not get us to ther requested KV issue a reseek.
> > >>>> I have also thought of a near/far designation of seeks. For near
> seeks
> > >>>> we'd do a configurable number of next()'s first, then seek.
> > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> > >>>>
> > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > >>>>
> > >>>> I'll look at the trace a bit closers.
> > >>>> So far my scan profiling has been focused on data in the blockcache
> > >> since
> > >>>> in the normal case the vast majority of all data is found there and
> > >> only
> > >>>> recent changes are in the memstore.
> > >>>>
> > >>>> -- Lars
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> ________________________________
> > >>>> From: Varun Sharma <va...@pinterest.com>
> > >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > >>> dev@hbase.apache.org"
> > >>>> <de...@hbase.apache.org>
> > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> > >>>>
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> We are seeing some unfortunately low performance in the memstore -
> we
> > >>> have
> > >>>> researched some of the previous JIRA(s) and seen some inefficiencies
> > in
> > >>> the
> > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 %
> cpu
> > >> at
> > >>>> weird points in time - the bug is hard to reproduce and there isn't
> > >> like
> > >>> a
> > >>>> huge # of extra reads going to that region server or any substantial
> > >>>> hotspot happening. The region server recovers the moment, we flush
> the
> > >>>> memstores or restart the region server. Our queries retrieve wide
> rows
> > >>>> which are upto 10-20 columns. A stack trace shows two things:
> > >>>>
> > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> > >>>> ConcurrentSkipListMap
> > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > >>>> StoreScanner - this is understandable since the rows contain many
> > >> columns
> > >>>> and StoreScanner iterates one KeyValue at a time.
> > >>>>
> > >>>> So, I was looking at the code and it seems that every single time
> > there
> > >>> is
> > >>>> a reseek call on the same memstore scanner, we make a fresh call to
> > >> build
> > >>>> an iterator() on the skip list set - this means we an additional
> skip
> > >>> list
> > >>>> lookup for every column retrieved. SkipList lookups are O(n) and not
> > >>> O(1).
> > >>>>
> > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> > >>> number
> > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> > reverted
> > >>> by
> > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
> skip
> > >>> list
> > >>>> lookup rather than being an iterator.
> > >>>>
> > >>>> Are there any strong reasons against having the previous behaviour
> of
> > >>>> scanning a small # of keys before degenerating to a skip list
> lookup ?
> > >>>> Seems like it would really help for sequential memstore scans and
> for
> > >>>> memstore gets with wide tables (even 10-20 columns).
> > >>>>
> > >>>> Thanks
> > >>>> Varun
> > >>
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Hi Ted,

Not satisfied with your answer, the document you sent does not talk about
Delete ColumnFamily marker sort order. For the delete family marker to
work, it has to mask *all* columns of a family. Hence it has to be above
all the older columns. All the new columns must come above this column
family delete marker. Now, clearly there will be columns above the delete
marker which are smaller than the ones below it.

The document talks nothing about delete marker order, could you answer the
question by looking through the example ?

Varun


On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:

> Varun:
> Take a look at http://hbase.apache.org/book.html#dm.sort
>
> There's no contradiction.
>
> Cheers
>
> On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Actually, I now have another question because of the way our work load is
> > structured. We use a wide schema and each time we write, we delete the
> > entire row and write a fresh set of columns - we want to make sure no old
> > columns survive. So, I just want to see if my picture of the memstore at
> > this point is correct or not. My understanding is that Memstore is
> > basically a skip list of keyvalues and compares the values using KeyValue
> > comparator
> >
> > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> >
> > (ROW, COL1, T=1)
> > (ROW, COL2, T=1)
> > (ROW, COL3, T=1)
> >
> > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > memstore has - my understanding is that we do not delete in the memstore
> > but only add markers:
> >
> > (ROW, <DELETE>, T=2)
> > (ROW, COL1, T=1)
> > (ROW, COL2, T=1)
> > (ROW, COL3, T=1)
> >
> > 3) Now we write our new fresh row for *T=3* - it should get inserted
> above
> > the delete.
> >
> > (ROW, COL1, T=3)
> > (ROW, COL2, T=3)
> > (ROW, COL3, T=3)
> > (ROW, <DELETE>, T=2)
> > (ROW, COL1, T=1)
> > (ROW, COL2, T=1)
> > (ROW, COL3, T=1)
> >
> > This is the ideal scenario for the data to be correctly reflected.
> >
> > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
> > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> >
> > But, we also know that KeyValues compare first by ROW, then by Column and
> > then by timestamp in reverse order
> >
> > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> >
> > This seems to be contradicting and my main worry is that in a skip list,
> it
> > is quite possible for skipping to happen as you go through the high level
> > express lanes and it could be possible for one of these entries to never
> > actually even see the delete marker. For example consider the case above
> > where entry #1 and entry #5 form the higher level of the skip list and
> the
> > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and
> it
> > could end up in the wrong location.
> >
> > Obviously, if we cleanse all the row contents when a get a ROW level
> delete
> > marker, we are fine but I want to know if that is the case. If we are not
> > really cleansing all the row contents when we get a ROW level delete
> > marker, then I want to know why the above scenario can not lead to bugs
> :)
> >
> > Varun
> >
> >
> > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > <vl...@gmail.com>wrote:
> >
> >> Varun,
> >>
> >> There is no need to open new JIRA - there are two already:
> >> https://issues.apache.org/jira/browse/HBASE-9769
> >> https://issues.apache.org/jira/browse/HBASE-9778
> >>
> >> Both with patches, you can grab and test them.
> >>
> >> -Vladimir
> >>
> >>
> >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>
> >>> Hi lars,
> >>>
> >>> Thanks for the background. It seems that for our case, we will have to
> >>> consider some solution like the Facebook one, since the next column is
> >>> always the next one - this can be a simple flag. I am going to raise a
> >> JIRA
> >>> and we can discuss there.
> >>>
> >>> Thanks
> >>> Varun
> >>>
> >>>
> >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> wrote:
> >>>
> >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime in
> >>>> soon. :)
> >>>>
> >>>> Reseek is expensive compared to next if next would get us the KV we're
> >>>> looking for. However, HBase does not know that ahead of time. There
> >> might
> >>>> be a 1000 versions of the previous KV to be skipped first.
> >>>> HBase seeks in three situation:
> >>>> 1. Seek to the next column (there might be a lot of versions to skip)
> >>>> 2. Seek to the next row (there might be a lot of versions and other
> >>>> columns to skip)
> >>>> 3. Seek to a row via a hint
> >>>>
> >>>> #3 is definitely useful, with that one can implement very efficient
> >> "skip
> >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> >>>> #2 is helpful if there are many columns and one only "selects" a few
> >> (and
> >>>> of course also if there are many versions of columns)
> >>>> #1 is only helpful when we expect there to be many versions. Or of the
> >>>> size of a typical KV aproaches the block size, since then we'd need to
> >>> seek
> >>>> to the find the next block anyway.
> >>>>
> >>>> You might well be a victim of #1. Are your rows 10-20 columns or is
> >> that
> >>>> just the number of column you return?
> >>>>
> >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
> >>> the
> >>>> scanner to not seek to the next column or the next row, but just issue
> >>>> next()'s until the KV is found. Another suggested approach (I think by
> >>> the
> >>>> Facebook guys) was to issue next() opportunistically a few times, and
> >>> only
> >>>> when that did not get us to ther requested KV issue a reseek.
> >>>> I have also thought of a near/far designation of seeks. For near seeks
> >>>> we'd do a configurable number of next()'s first, then seek.
> >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> >>>>
> >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> >>>>
> >>>> I'll look at the trace a bit closers.
> >>>> So far my scan profiling has been focused on data in the blockcache
> >> since
> >>>> in the normal case the vast majority of all data is found there and
> >> only
> >>>> recent changes are in the memstore.
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: Varun Sharma <va...@pinterest.com>
> >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> >>> dev@hbase.apache.org"
> >>>> <de...@hbase.apache.org>
> >>>> Sent: Sunday, January 26, 2014 1:14 PM
> >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> We are seeing some unfortunately low performance in the memstore - we
> >>> have
> >>>> researched some of the previous JIRA(s) and seen some inefficiencies
> in
> >>> the
> >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
> >> at
> >>>> weird points in time - the bug is hard to reproduce and there isn't
> >> like
> >>> a
> >>>> huge # of extra reads going to that region server or any substantial
> >>>> hotspot happening. The region server recovers the moment, we flush the
> >>>> memstores or restart the region server. Our queries retrieve wide rows
> >>>> which are upto 10-20 columns. A stack trace shows two things:
> >>>>
> >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> >>>> ConcurrentSkipListMap
> >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> >>>> StoreScanner - this is understandable since the rows contain many
> >> columns
> >>>> and StoreScanner iterates one KeyValue at a time.
> >>>>
> >>>> So, I was looking at the code and it seems that every single time
> there
> >>> is
> >>>> a reseek call on the same memstore scanner, we make a fresh call to
> >> build
> >>>> an iterator() on the skip list set - this means we an additional skip
> >>> list
> >>>> lookup for every column retrieved. SkipList lookups are O(n) and not
> >>> O(1).
> >>>>
> >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> >>> number
> >>>> if exceeded, do a lookup. However, it seems this behaviour was
> reverted
> >>> by
> >>>> HBASE 4195 and every next row/next column is now a reseek() and a skip
> >>> list
> >>>> lookup rather than being an iterator.
> >>>>
> >>>> Are there any strong reasons against having the previous behaviour of
> >>>> scanning a small # of keys before degenerating to a skip list lookup ?
> >>>> Seems like it would really help for sequential memstore scans and for
> >>>> memstore gets with wide tables (even 10-20 columns).
> >>>>
> >>>> Thanks
> >>>> Varun
> >>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Hi Ted,

Not satisfied with your answer, the document you sent does not talk about
Delete ColumnFamily marker sort order. For the delete family marker to
work, it has to mask *all* columns of a family. Hence it has to be above
all the older columns. All the new columns must come above this column
family delete marker. Now, clearly there will be columns above the delete
marker which are smaller than the ones below it.

The document talks nothing about delete marker order, could you answer the
question by looking through the example ?

Varun


On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yu...@gmail.com> wrote:

> Varun:
> Take a look at http://hbase.apache.org/book.html#dm.sort
>
> There's no contradiction.
>
> Cheers
>
> On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Actually, I now have another question because of the way our work load is
> > structured. We use a wide schema and each time we write, we delete the
> > entire row and write a fresh set of columns - we want to make sure no old
> > columns survive. So, I just want to see if my picture of the memstore at
> > this point is correct or not. My understanding is that Memstore is
> > basically a skip list of keyvalues and compares the values using KeyValue
> > comparator
> >
> > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> >
> > (ROW, COL1, T=1)
> > (ROW, COL2, T=1)
> > (ROW, COL3, T=1)
> >
> > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > memstore has - my understanding is that we do not delete in the memstore
> > but only add markers:
> >
> > (ROW, <DELETE>, T=2)
> > (ROW, COL1, T=1)
> > (ROW, COL2, T=1)
> > (ROW, COL3, T=1)
> >
> > 3) Now we write our new fresh row for *T=3* - it should get inserted
> above
> > the delete.
> >
> > (ROW, COL1, T=3)
> > (ROW, COL2, T=3)
> > (ROW, COL3, T=3)
> > (ROW, <DELETE>, T=2)
> > (ROW, COL1, T=1)
> > (ROW, COL2, T=1)
> > (ROW, COL3, T=1)
> >
> > This is the ideal scenario for the data to be correctly reflected.
> >
> > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
> > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> >
> > But, we also know that KeyValues compare first by ROW, then by Column and
> > then by timestamp in reverse order
> >
> > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> >
> > This seems to be contradicting and my main worry is that in a skip list,
> it
> > is quite possible for skipping to happen as you go through the high level
> > express lanes and it could be possible for one of these entries to never
> > actually even see the delete marker. For example consider the case above
> > where entry #1 and entry #5 form the higher level of the skip list and
> the
> > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and
> it
> > could end up in the wrong location.
> >
> > Obviously, if we cleanse all the row contents when a get a ROW level
> delete
> > marker, we are fine but I want to know if that is the case. If we are not
> > really cleansing all the row contents when we get a ROW level delete
> > marker, then I want to know why the above scenario can not lead to bugs
> :)
> >
> > Varun
> >
> >
> > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > <vl...@gmail.com>wrote:
> >
> >> Varun,
> >>
> >> There is no need to open new JIRA - there are two already:
> >> https://issues.apache.org/jira/browse/HBASE-9769
> >> https://issues.apache.org/jira/browse/HBASE-9778
> >>
> >> Both with patches, you can grab and test them.
> >>
> >> -Vladimir
> >>
> >>
> >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>
> >>> Hi lars,
> >>>
> >>> Thanks for the background. It seems that for our case, we will have to
> >>> consider some solution like the Facebook one, since the next column is
> >>> always the next one - this can be a simple flag. I am going to raise a
> >> JIRA
> >>> and we can discuss there.
> >>>
> >>> Thanks
> >>> Varun
> >>>
> >>>
> >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org>
> wrote:
> >>>
> >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime in
> >>>> soon. :)
> >>>>
> >>>> Reseek is expensive compared to next if next would get us the KV we're
> >>>> looking for. However, HBase does not know that ahead of time. There
> >> might
> >>>> be a 1000 versions of the previous KV to be skipped first.
> >>>> HBase seeks in three situation:
> >>>> 1. Seek to the next column (there might be a lot of versions to skip)
> >>>> 2. Seek to the next row (there might be a lot of versions and other
> >>>> columns to skip)
> >>>> 3. Seek to a row via a hint
> >>>>
> >>>> #3 is definitely useful, with that one can implement very efficient
> >> "skip
> >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> >>>> #2 is helpful if there are many columns and one only "selects" a few
> >> (and
> >>>> of course also if there are many versions of columns)
> >>>> #1 is only helpful when we expect there to be many versions. Or of the
> >>>> size of a typical KV aproaches the block size, since then we'd need to
> >>> seek
> >>>> to the find the next block anyway.
> >>>>
> >>>> You might well be a victim of #1. Are your rows 10-20 columns or is
> >> that
> >>>> just the number of column you return?
> >>>>
> >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
> >>> the
> >>>> scanner to not seek to the next column or the next row, but just issue
> >>>> next()'s until the KV is found. Another suggested approach (I think by
> >>> the
> >>>> Facebook guys) was to issue next() opportunistically a few times, and
> >>> only
> >>>> when that did not get us to ther requested KV issue a reseek.
> >>>> I have also thought of a near/far designation of seeks. For near seeks
> >>>> we'd do a configurable number of next()'s first, then seek.
> >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> >>>>
> >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> >>>>
> >>>> I'll look at the trace a bit closers.
> >>>> So far my scan profiling has been focused on data in the blockcache
> >> since
> >>>> in the normal case the vast majority of all data is found there and
> >> only
> >>>> recent changes are in the memstore.
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: Varun Sharma <va...@pinterest.com>
> >>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> >>> dev@hbase.apache.org"
> >>>> <de...@hbase.apache.org>
> >>>> Sent: Sunday, January 26, 2014 1:14 PM
> >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> We are seeing some unfortunately low performance in the memstore - we
> >>> have
> >>>> researched some of the previous JIRA(s) and seen some inefficiencies
> in
> >>> the
> >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
> >> at
> >>>> weird points in time - the bug is hard to reproduce and there isn't
> >> like
> >>> a
> >>>> huge # of extra reads going to that region server or any substantial
> >>>> hotspot happening. The region server recovers the moment, we flush the
> >>>> memstores or restart the region server. Our queries retrieve wide rows
> >>>> which are upto 10-20 columns. A stack trace shows two things:
> >>>>
> >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> >>>> ConcurrentSkipListMap
> >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> >>>> StoreScanner - this is understandable since the rows contain many
> >> columns
> >>>> and StoreScanner iterates one KeyValue at a time.
> >>>>
> >>>> So, I was looking at the code and it seems that every single time
> there
> >>> is
> >>>> a reseek call on the same memstore scanner, we make a fresh call to
> >> build
> >>>> an iterator() on the skip list set - this means we an additional skip
> >>> list
> >>>> lookup for every column retrieved. SkipList lookups are O(n) and not
> >>> O(1).
> >>>>
> >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> >>> number
> >>>> if exceeded, do a lookup. However, it seems this behaviour was
> reverted
> >>> by
> >>>> HBASE 4195 and every next row/next column is now a reseek() and a skip
> >>> list
> >>>> lookup rather than being an iterator.
> >>>>
> >>>> Are there any strong reasons against having the previous behaviour of
> >>>> scanning a small # of keys before degenerating to a skip list lookup ?
> >>>> Seems like it would really help for sequential memstore scans and for
> >>>> memstore gets with wide tables (even 10-20 columns).
> >>>>
> >>>> Thanks
> >>>> Varun
> >>
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Ted Yu <yu...@gmail.com>.
Varun:
Take a look at http://hbase.apache.org/book.html#dm.sort

There's no contradiction. 

Cheers

On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:

> Actually, I now have another question because of the way our work load is
> structured. We use a wide schema and each time we write, we delete the
> entire row and write a fresh set of columns - we want to make sure no old
> columns survive. So, I just want to see if my picture of the memstore at
> this point is correct or not. My understanding is that Memstore is
> basically a skip list of keyvalues and compares the values using KeyValue
> comparator
> 
> 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> 
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> memstore has - my understanding is that we do not delete in the memstore
> but only add markers:
> 
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> 3) Now we write our new fresh row for *T=3* - it should get inserted above
> the delete.
> 
> (ROW, COL1, T=3)
> (ROW, COL2, T=3)
> (ROW, COL3, T=3)
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> This is the ideal scenario for the data to be correctly reflected.
> 
> (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
> *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> 
> But, we also know that KeyValues compare first by ROW, then by Column and
> then by timestamp in reverse order
> 
> *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> 
> This seems to be contradicting and my main worry is that in a skip list, it
> is quite possible for skipping to happen as you go through the high level
> express lanes and it could be possible for one of these entries to never
> actually even see the delete marker. For example consider the case above
> where entry #1 and entry #5 form the higher level of the skip list and the
> skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and it
> could end up in the wrong location.
> 
> Obviously, if we cleanse all the row contents when a get a ROW level delete
> marker, we are fine but I want to know if that is the case. If we are not
> really cleansing all the row contents when we get a ROW level delete
> marker, then I want to know why the above scenario can not lead to bugs :)
> 
> Varun
> 
> 
> On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
> 
>> Varun,
>> 
>> There is no need to open new JIRA - there are two already:
>> https://issues.apache.org/jira/browse/HBASE-9769
>> https://issues.apache.org/jira/browse/HBASE-9778
>> 
>> Both with patches, you can grab and test them.
>> 
>> -Vladimir
>> 
>> 
>> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com> wrote:
>> 
>>> Hi lars,
>>> 
>>> Thanks for the background. It seems that for our case, we will have to
>>> consider some solution like the Facebook one, since the next column is
>>> always the next one - this can be a simple flag. I am going to raise a
>> JIRA
>>> and we can discuss there.
>>> 
>>> Thanks
>>> Varun
>>> 
>>> 
>>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> This is somewhat of a known issue, and I'm sure Vladimir will chime in
>>>> soon. :)
>>>> 
>>>> Reseek is expensive compared to next if next would get us the KV we're
>>>> looking for. However, HBase does not know that ahead of time. There
>> might
>>>> be a 1000 versions of the previous KV to be skipped first.
>>>> HBase seeks in three situation:
>>>> 1. Seek to the next column (there might be a lot of versions to skip)
>>>> 2. Seek to the next row (there might be a lot of versions and other
>>>> columns to skip)
>>>> 3. Seek to a row via a hint
>>>> 
>>>> #3 is definitely useful, with that one can implement very efficient
>> "skip
>>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>>>> #2 is helpful if there are many columns and one only "selects" a few
>> (and
>>>> of course also if there are many versions of columns)
>>>> #1 is only helpful when we expect there to be many versions. Or of the
>>>> size of a typical KV aproaches the block size, since then we'd need to
>>> seek
>>>> to the find the next block anyway.
>>>> 
>>>> You might well be a victim of #1. Are your rows 10-20 columns or is
>> that
>>>> just the number of column you return?
>>>> 
>>>> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
>>> the
>>>> scanner to not seek to the next column or the next row, but just issue
>>>> next()'s until the KV is found. Another suggested approach (I think by
>>> the
>>>> Facebook guys) was to issue next() opportunistically a few times, and
>>> only
>>>> when that did not get us to ther requested KV issue a reseek.
>>>> I have also thought of a near/far designation of seeks. For near seeks
>>>> we'd do a configurable number of next()'s first, then seek.
>>>> "near" seeks would be those of category #1 (and maybe #2) above.
>>>> 
>>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>>>> 
>>>> I'll look at the trace a bit closers.
>>>> So far my scan profiling has been focused on data in the blockcache
>> since
>>>> in the normal case the vast majority of all data is found there and
>> only
>>>> recent changes are in the memstore.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Varun Sharma <va...@pinterest.com>
>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
>>> dev@hbase.apache.org"
>>>> <de...@hbase.apache.org>
>>>> Sent: Sunday, January 26, 2014 1:14 PM
>>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> We are seeing some unfortunately low performance in the memstore - we
>>> have
>>>> researched some of the previous JIRA(s) and seen some inefficiencies in
>>> the
>>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
>> at
>>>> weird points in time - the bug is hard to reproduce and there isn't
>> like
>>> a
>>>> huge # of extra reads going to that region server or any substantial
>>>> hotspot happening. The region server recovers the moment, we flush the
>>>> memstores or restart the region server. Our queries retrieve wide rows
>>>> which are upto 10-20 columns. A stack trace shows two things:
>>>> 
>>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>>>> ConcurrentSkipListMap
>>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>>>> StoreScanner - this is understandable since the rows contain many
>> columns
>>>> and StoreScanner iterates one KeyValue at a time.
>>>> 
>>>> So, I was looking at the code and it seems that every single time there
>>> is
>>>> a reseek call on the same memstore scanner, we make a fresh call to
>> build
>>>> an iterator() on the skip list set - this means we an additional skip
>>> list
>>>> lookup for every column retrieved. SkipList lookups are O(n) and not
>>> O(1).
>>>> 
>>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
>>> number
>>>> if exceeded, do a lookup. However, it seems this behaviour was reverted
>>> by
>>>> HBASE 4195 and every next row/next column is now a reseek() and a skip
>>> list
>>>> lookup rather than being an iterator.
>>>> 
>>>> Are there any strong reasons against having the previous behaviour of
>>>> scanning a small # of keys before degenerating to a skip list lookup ?
>>>> Seems like it would really help for sequential memstore scans and for
>>>> memstore gets with wide tables (even 10-20 columns).
>>>> 
>>>> Thanks
>>>> Varun
>> 

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Ted Yu <yu...@gmail.com>.
Varun:
Take a look at http://hbase.apache.org/book.html#dm.sort

There's no contradiction. 

Cheers

On Jan 27, 2014, at 11:40 PM, Varun Sharma <va...@pinterest.com> wrote:

> Actually, I now have another question because of the way our work load is
> structured. We use a wide schema and each time we write, we delete the
> entire row and write a fresh set of columns - we want to make sure no old
> columns survive. So, I just want to see if my picture of the memstore at
> this point is correct or not. My understanding is that Memstore is
> basically a skip list of keyvalues and compares the values using KeyValue
> comparator
> 
> 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> 
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> memstore has - my understanding is that we do not delete in the memstore
> but only add markers:
> 
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> 3) Now we write our new fresh row for *T=3* - it should get inserted above
> the delete.
> 
> (ROW, COL1, T=3)
> (ROW, COL2, T=3)
> (ROW, COL3, T=3)
> (ROW, <DELETE>, T=2)
> (ROW, COL1, T=1)
> (ROW, COL2, T=1)
> (ROW, COL3, T=1)
> 
> This is the ideal scenario for the data to be correctly reflected.
> 
> (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
> *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> 
> But, we also know that KeyValues compare first by ROW, then by Column and
> then by timestamp in reverse order
> 
> *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> 
> This seems to be contradicting and my main worry is that in a skip list, it
> is quite possible for skipping to happen as you go through the high level
> express lanes and it could be possible for one of these entries to never
> actually even see the delete marker. For example consider the case above
> where entry #1 and entry #5 form the higher level of the skip list and the
> skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and it
> could end up in the wrong location.
> 
> Obviously, if we cleanse all the row contents when a get a ROW level delete
> marker, we are fine but I want to know if that is the case. If we are not
> really cleansing all the row contents when we get a ROW level delete
> marker, then I want to know why the above scenario can not lead to bugs :)
> 
> Varun
> 
> 
> On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
> 
>> Varun,
>> 
>> There is no need to open new JIRA - there are two already:
>> https://issues.apache.org/jira/browse/HBASE-9769
>> https://issues.apache.org/jira/browse/HBASE-9778
>> 
>> Both with patches, you can grab and test them.
>> 
>> -Vladimir
>> 
>> 
>> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com> wrote:
>> 
>>> Hi lars,
>>> 
>>> Thanks for the background. It seems that for our case, we will have to
>>> consider some solution like the Facebook one, since the next column is
>>> always the next one - this can be a simple flag. I am going to raise a
>> JIRA
>>> and we can discuss there.
>>> 
>>> Thanks
>>> Varun
>>> 
>>> 
>>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>>> This is somewhat of a known issue, and I'm sure Vladimir will chime in
>>>> soon. :)
>>>> 
>>>> Reseek is expensive compared to next if next would get us the KV we're
>>>> looking for. However, HBase does not know that ahead of time. There
>> might
>>>> be a 1000 versions of the previous KV to be skipped first.
>>>> HBase seeks in three situation:
>>>> 1. Seek to the next column (there might be a lot of versions to skip)
>>>> 2. Seek to the next row (there might be a lot of versions and other
>>>> columns to skip)
>>>> 3. Seek to a row via a hint
>>>> 
>>>> #3 is definitely useful, with that one can implement very efficient
>> "skip
>>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>>>> #2 is helpful if there are many columns and one only "selects" a few
>> (and
>>>> of course also if there are many versions of columns)
>>>> #1 is only helpful when we expect there to be many versions. Or of the
>>>> size of a typical KV aproaches the block size, since then we'd need to
>>> seek
>>>> to the find the next block anyway.
>>>> 
>>>> You might well be a victim of #1. Are your rows 10-20 columns or is
>> that
>>>> just the number of column you return?
>>>> 
>>>> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
>>> the
>>>> scanner to not seek to the next column or the next row, but just issue
>>>> next()'s until the KV is found. Another suggested approach (I think by
>>> the
>>>> Facebook guys) was to issue next() opportunistically a few times, and
>>> only
>>>> when that did not get us to ther requested KV issue a reseek.
>>>> I have also thought of a near/far designation of seeks. For near seeks
>>>> we'd do a configurable number of next()'s first, then seek.
>>>> "near" seeks would be those of category #1 (and maybe #2) above.
>>>> 
>>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>>>> 
>>>> I'll look at the trace a bit closers.
>>>> So far my scan profiling has been focused on data in the blockcache
>> since
>>>> in the normal case the vast majority of all data is found there and
>> only
>>>> recent changes are in the memstore.
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Varun Sharma <va...@pinterest.com>
>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
>>> dev@hbase.apache.org"
>>>> <de...@hbase.apache.org>
>>>> Sent: Sunday, January 26, 2014 1:14 PM
>>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> We are seeing some unfortunately low performance in the memstore - we
>>> have
>>>> researched some of the previous JIRA(s) and seen some inefficiencies in
>>> the
>>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
>> at
>>>> weird points in time - the bug is hard to reproduce and there isn't
>> like
>>> a
>>>> huge # of extra reads going to that region server or any substantial
>>>> hotspot happening. The region server recovers the moment, we flush the
>>>> memstores or restart the region server. Our queries retrieve wide rows
>>>> which are upto 10-20 columns. A stack trace shows two things:
>>>> 
>>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>>>> ConcurrentSkipListMap
>>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>>>> StoreScanner - this is understandable since the rows contain many
>> columns
>>>> and StoreScanner iterates one KeyValue at a time.
>>>> 
>>>> So, I was looking at the code and it seems that every single time there
>>> is
>>>> a reseek call on the same memstore scanner, we make a fresh call to
>> build
>>>> an iterator() on the skip list set - this means we an additional skip
>>> list
>>>> lookup for every column retrieved. SkipList lookups are O(n) and not
>>> O(1).
>>>> 
>>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
>>> number
>>>> if exceeded, do a lookup. However, it seems this behaviour was reverted
>>> by
>>>> HBASE 4195 and every next row/next column is now a reseek() and a skip
>>> list
>>>> lookup rather than being an iterator.
>>>> 
>>>> Are there any strong reasons against having the previous behaviour of
>>>> scanning a small # of keys before degenerating to a skip list lookup ?
>>>> Seems like it would really help for sequential memstore scans and for
>>>> memstore gets with wide tables (even 10-20 columns).
>>>> 
>>>> Thanks
>>>> Varun
>> 

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Actually, I now have another question because of the way our work load is
structured. We use a wide schema and each time we write, we delete the
entire row and write a fresh set of columns - we want to make sure no old
columns survive. So, I just want to see if my picture of the memstore at
this point is correct or not. My understanding is that Memstore is
basically a skip list of keyvalues and compares the values using KeyValue
comparator

1) *T=1, *We write 3 columns for "ROW". So memstore has:

(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
memstore has - my understanding is that we do not delete in the memstore
but only add markers:

(ROW, <DELETE>, T=2)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

3) Now we write our new fresh row for *T=3* - it should get inserted above
the delete.

(ROW, COL1, T=3)
(ROW, COL2, T=3)
(ROW, COL3, T=3)
(ROW, <DELETE>, T=2)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

This is the ideal scenario for the data to be correctly reflected.

(ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
*(ROW, COL2, T=3) > (ROW, COL1, T=1)*

But, we also know that KeyValues compare first by ROW, then by Column and
then by timestamp in reverse order

*(ROW, COL2, T=3) < (ROW, COL1, T=1) *

This seems to be contradicting and my main worry is that in a skip list, it
is quite possible for skipping to happen as you go through the high level
express lanes and it could be possible for one of these entries to never
actually even see the delete marker. For example consider the case above
where entry #1 and entry #5 form the higher level of the skip list and the
skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and it
could end up in the wrong location.

Obviously, if we cleanse all the row contents when a get a ROW level delete
marker, we are fine but I want to know if that is the case. If we are not
really cleansing all the row contents when we get a ROW level delete
marker, then I want to know why the above scenario can not lead to bugs :)

Varun


On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> Varun,
>
> There is no need to open new JIRA - there are two already:
> https://issues.apache.org/jira/browse/HBASE-9769
> https://issues.apache.org/jira/browse/HBASE-9778
>
> Both with patches, you can grab and test them.
>
> -Vladimir
>
>
> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi lars,
> >
> > Thanks for the background. It seems that for our case, we will have to
> > consider some solution like the Facebook one, since the next column is
> > always the next one - this can be a simple flag. I am going to raise a
> JIRA
> > and we can discuss there.
> >
> > Thanks
> > Varun
> >
> >
> > On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > This is somewhat of a known issue, and I'm sure Vladimir will chime in
> > > soon. :)
> > >
> > > Reseek is expensive compared to next if next would get us the KV we're
> > > looking for. However, HBase does not know that ahead of time. There
> might
> > > be a 1000 versions of the previous KV to be skipped first.
> > > HBase seeks in three situation:
> > > 1. Seek to the next column (there might be a lot of versions to skip)
> > > 2. Seek to the next row (there might be a lot of versions and other
> > > columns to skip)
> > > 3. Seek to a row via a hint
> > >
> > > #3 is definitely useful, with that one can implement very efficient
> "skip
> > > scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > > #2 is helpful if there are many columns and one only "selects" a few
> (and
> > > of course also if there are many versions of columns)
> > > #1 is only helpful when we expect there to be many versions. Or of the
> > > size of a typical KV aproaches the block size, since then we'd need to
> > seek
> > > to the find the next block anyway.
> > >
> > > You might well be a victim of #1. Are your rows 10-20 columns or is
> that
> > > just the number of column you return?
> > >
> > > Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
> > the
> > > scanner to not seek to the next column or the next row, but just issue
> > > next()'s until the KV is found. Another suggested approach (I think by
> > the
> > > Facebook guys) was to issue next() opportunistically a few times, and
> > only
> > > when that did not get us to ther requested KV issue a reseek.
> > > I have also thought of a near/far designation of seeks. For near seeks
> > > we'd do a configurable number of next()'s first, then seek.
> > > "near" seeks would be those of category #1 (and maybe #2) above.
> > >
> > > See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > >
> > > I'll look at the trace a bit closers.
> > > So far my scan profiling has been focused on data in the blockcache
> since
> > > in the normal case the vast majority of all data is found there and
> only
> > > recent changes are in the memstore.
> > >
> > > -- Lars
> > >
> > >
> > >
> > >
> > > ________________________________
> > >  From: Varun Sharma <va...@pinterest.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > dev@hbase.apache.org"
> > > <de...@hbase.apache.org>
> > > Sent: Sunday, January 26, 2014 1:14 PM
> > > Subject: Sporadic memstore slowness for Read Heavy workloads
> > >
> > >
> > > Hi,
> > >
> > > We are seeing some unfortunately low performance in the memstore - we
> > have
> > > researched some of the previous JIRA(s) and seen some inefficiencies in
> > the
> > > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
> at
> > > weird points in time - the bug is hard to reproduce and there isn't
> like
> > a
> > > huge # of extra reads going to that region server or any substantial
> > > hotspot happening. The region server recovers the moment, we flush the
> > > memstores or restart the region server. Our queries retrieve wide rows
> > > which are upto 10-20 columns. A stack trace shows two things:
> > >
> > > 1) Time spent inside MemstoreScanner.reseek() and inside the
> > > ConcurrentSkipListMap
> > > 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > > StoreScanner - this is understandable since the rows contain many
> columns
> > > and StoreScanner iterates one KeyValue at a time.
> > >
> > > So, I was looking at the code and it seems that every single time there
> > is
> > > a reseek call on the same memstore scanner, we make a fresh call to
> build
> > > an iterator() on the skip list set - this means we an additional skip
> > list
> > > lookup for every column retrieved. SkipList lookups are O(n) and not
> > O(1).
> > >
> > > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> > number
> > > if exceeded, do a lookup. However, it seems this behaviour was reverted
> > by
> > > HBASE 4195 and every next row/next column is now a reseek() and a skip
> > list
> > > lookup rather than being an iterator.
> > >
> > > Are there any strong reasons against having the previous behaviour of
> > > scanning a small # of keys before degenerating to a skip list lookup ?
> > > Seems like it would really help for sequential memstore scans and for
> > > memstore gets with wide tables (even 10-20 columns).
> > >
> > > Thanks
> > > Varun
> > >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Actually, I now have another question because of the way our work load is
structured. We use a wide schema and each time we write, we delete the
entire row and write a fresh set of columns - we want to make sure no old
columns survive. So, I just want to see if my picture of the memstore at
this point is correct or not. My understanding is that Memstore is
basically a skip list of keyvalues and compares the values using KeyValue
comparator

1) *T=1, *We write 3 columns for "ROW". So memstore has:

(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
memstore has - my understanding is that we do not delete in the memstore
but only add markers:

(ROW, <DELETE>, T=2)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

3) Now we write our new fresh row for *T=3* - it should get inserted above
the delete.

(ROW, COL1, T=3)
(ROW, COL2, T=3)
(ROW, COL3, T=3)
(ROW, <DELETE>, T=2)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

This is the ideal scenario for the data to be correctly reflected.

(ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
*(ROW, COL2, T=3) > (ROW, COL1, T=1)*

But, we also know that KeyValues compare first by ROW, then by Column and
then by timestamp in reverse order

*(ROW, COL2, T=3) < (ROW, COL1, T=1) *

This seems to be contradicting and my main worry is that in a skip list, it
is quite possible for skipping to happen as you go through the high level
express lanes and it could be possible for one of these entries to never
actually even see the delete marker. For example consider the case above
where entry #1 and entry #5 form the higher level of the skip list and the
skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and it
could end up in the wrong location.

Obviously, if we cleanse all the row contents when a get a ROW level delete
marker, we are fine but I want to know if that is the case. If we are not
really cleansing all the row contents when we get a ROW level delete
marker, then I want to know why the above scenario can not lead to bugs :)

Varun


On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> Varun,
>
> There is no need to open new JIRA - there are two already:
> https://issues.apache.org/jira/browse/HBASE-9769
> https://issues.apache.org/jira/browse/HBASE-9778
>
> Both with patches, you can grab and test them.
>
> -Vladimir
>
>
> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi lars,
> >
> > Thanks for the background. It seems that for our case, we will have to
> > consider some solution like the Facebook one, since the next column is
> > always the next one - this can be a simple flag. I am going to raise a
> JIRA
> > and we can discuss there.
> >
> > Thanks
> > Varun
> >
> >
> > On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > This is somewhat of a known issue, and I'm sure Vladimir will chime in
> > > soon. :)
> > >
> > > Reseek is expensive compared to next if next would get us the KV we're
> > > looking for. However, HBase does not know that ahead of time. There
> might
> > > be a 1000 versions of the previous KV to be skipped first.
> > > HBase seeks in three situation:
> > > 1. Seek to the next column (there might be a lot of versions to skip)
> > > 2. Seek to the next row (there might be a lot of versions and other
> > > columns to skip)
> > > 3. Seek to a row via a hint
> > >
> > > #3 is definitely useful, with that one can implement very efficient
> "skip
> > > scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > > #2 is helpful if there are many columns and one only "selects" a few
> (and
> > > of course also if there are many versions of columns)
> > > #1 is only helpful when we expect there to be many versions. Or of the
> > > size of a typical KV aproaches the block size, since then we'd need to
> > seek
> > > to the find the next block anyway.
> > >
> > > You might well be a victim of #1. Are your rows 10-20 columns or is
> that
> > > just the number of column you return?
> > >
> > > Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
> > the
> > > scanner to not seek to the next column or the next row, but just issue
> > > next()'s until the KV is found. Another suggested approach (I think by
> > the
> > > Facebook guys) was to issue next() opportunistically a few times, and
> > only
> > > when that did not get us to ther requested KV issue a reseek.
> > > I have also thought of a near/far designation of seeks. For near seeks
> > > we'd do a configurable number of next()'s first, then seek.
> > > "near" seeks would be those of category #1 (and maybe #2) above.
> > >
> > > See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > >
> > > I'll look at the trace a bit closers.
> > > So far my scan profiling has been focused on data in the blockcache
> since
> > > in the normal case the vast majority of all data is found there and
> only
> > > recent changes are in the memstore.
> > >
> > > -- Lars
> > >
> > >
> > >
> > >
> > > ________________________________
> > >  From: Varun Sharma <va...@pinterest.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> > dev@hbase.apache.org"
> > > <de...@hbase.apache.org>
> > > Sent: Sunday, January 26, 2014 1:14 PM
> > > Subject: Sporadic memstore slowness for Read Heavy workloads
> > >
> > >
> > > Hi,
> > >
> > > We are seeing some unfortunately low performance in the memstore - we
> > have
> > > researched some of the previous JIRA(s) and seen some inefficiencies in
> > the
> > > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
> at
> > > weird points in time - the bug is hard to reproduce and there isn't
> like
> > a
> > > huge # of extra reads going to that region server or any substantial
> > > hotspot happening. The region server recovers the moment, we flush the
> > > memstores or restart the region server. Our queries retrieve wide rows
> > > which are upto 10-20 columns. A stack trace shows two things:
> > >
> > > 1) Time spent inside MemstoreScanner.reseek() and inside the
> > > ConcurrentSkipListMap
> > > 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > > StoreScanner - this is understandable since the rows contain many
> columns
> > > and StoreScanner iterates one KeyValue at a time.
> > >
> > > So, I was looking at the code and it seems that every single time there
> > is
> > > a reseek call on the same memstore scanner, we make a fresh call to
> build
> > > an iterator() on the skip list set - this means we an additional skip
> > list
> > > lookup for every column retrieved. SkipList lookups are O(n) and not
> > O(1).
> > >
> > > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> > number
> > > if exceeded, do a lookup. However, it seems this behaviour was reverted
> > by
> > > HBASE 4195 and every next row/next column is now a reseek() and a skip
> > list
> > > lookup rather than being an iterator.
> > >
> > > Are there any strong reasons against having the previous behaviour of
> > > scanning a small # of keys before degenerating to a skip list lookup ?
> > > Seems like it would really help for sequential memstore scans and for
> > > memstore gets with wide tables (even 10-20 columns).
> > >
> > > Thanks
> > > Varun
> > >
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Vladimir Rodionov <vl...@gmail.com>.
Varun,

There is no need to open new JIRA - there are two already:
https://issues.apache.org/jira/browse/HBASE-9769
https://issues.apache.org/jira/browse/HBASE-9778

Both with patches, you can grab and test them.

-Vladimir


On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi lars,
>
> Thanks for the background. It seems that for our case, we will have to
> consider some solution like the Facebook one, since the next column is
> always the next one - this can be a simple flag. I am going to raise a JIRA
> and we can discuss there.
>
> Thanks
> Varun
>
>
> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:
>
> > This is somewhat of a known issue, and I'm sure Vladimir will chime in
> > soon. :)
> >
> > Reseek is expensive compared to next if next would get us the KV we're
> > looking for. However, HBase does not know that ahead of time. There might
> > be a 1000 versions of the previous KV to be skipped first.
> > HBase seeks in three situation:
> > 1. Seek to the next column (there might be a lot of versions to skip)
> > 2. Seek to the next row (there might be a lot of versions and other
> > columns to skip)
> > 3. Seek to a row via a hint
> >
> > #3 is definitely useful, with that one can implement very efficient "skip
> > scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > #2 is helpful if there are many columns and one only "selects" a few (and
> > of course also if there are many versions of columns)
> > #1 is only helpful when we expect there to be many versions. Or of the
> > size of a typical KV aproaches the block size, since then we'd need to
> seek
> > to the find the next block anyway.
> >
> > You might well be a victim of #1. Are your rows 10-20 columns or is that
> > just the number of column you return?
> >
> > Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
> the
> > scanner to not seek to the next column or the next row, but just issue
> > next()'s until the KV is found. Another suggested approach (I think by
> the
> > Facebook guys) was to issue next() opportunistically a few times, and
> only
> > when that did not get us to ther requested KV issue a reseek.
> > I have also thought of a near/far designation of seeks. For near seeks
> > we'd do a configurable number of next()'s first, then seek.
> > "near" seeks would be those of category #1 (and maybe #2) above.
> >
> > See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> >
> > I'll look at the trace a bit closers.
> > So far my scan profiling has been focused on data in the blockcache since
> > in the normal case the vast majority of all data is found there and only
> > recent changes are in the memstore.
> >
> > -- Lars
> >
> >
> >
> >
> > ________________________________
> >  From: Varun Sharma <va...@pinterest.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> dev@hbase.apache.org"
> > <de...@hbase.apache.org>
> > Sent: Sunday, January 26, 2014 1:14 PM
> > Subject: Sporadic memstore slowness for Read Heavy workloads
> >
> >
> > Hi,
> >
> > We are seeing some unfortunately low performance in the memstore - we
> have
> > researched some of the previous JIRA(s) and seen some inefficiencies in
> the
> > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> > weird points in time - the bug is hard to reproduce and there isn't like
> a
> > huge # of extra reads going to that region server or any substantial
> > hotspot happening. The region server recovers the moment, we flush the
> > memstores or restart the region server. Our queries retrieve wide rows
> > which are upto 10-20 columns. A stack trace shows two things:
> >
> > 1) Time spent inside MemstoreScanner.reseek() and inside the
> > ConcurrentSkipListMap
> > 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > StoreScanner - this is understandable since the rows contain many columns
> > and StoreScanner iterates one KeyValue at a time.
> >
> > So, I was looking at the code and it seems that every single time there
> is
> > a reseek call on the same memstore scanner, we make a fresh call to build
> > an iterator() on the skip list set - this means we an additional skip
> list
> > lookup for every column retrieved. SkipList lookups are O(n) and not
> O(1).
> >
> > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> number
> > if exceeded, do a lookup. However, it seems this behaviour was reverted
> by
> > HBASE 4195 and every next row/next column is now a reseek() and a skip
> list
> > lookup rather than being an iterator.
> >
> > Are there any strong reasons against having the previous behaviour of
> > scanning a small # of keys before degenerating to a skip list lookup ?
> > Seems like it would really help for sequential memstore scans and for
> > memstore gets with wide tables (even 10-20 columns).
> >
> > Thanks
> > Varun
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Vladimir Rodionov <vl...@gmail.com>.
Varun,

There is no need to open new JIRA - there are two already:
https://issues.apache.org/jira/browse/HBASE-9769
https://issues.apache.org/jira/browse/HBASE-9778

Both with patches, you can grab and test them.

-Vladimir


On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi lars,
>
> Thanks for the background. It seems that for our case, we will have to
> consider some solution like the Facebook one, since the next column is
> always the next one - this can be a simple flag. I am going to raise a JIRA
> and we can discuss there.
>
> Thanks
> Varun
>
>
> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:
>
> > This is somewhat of a known issue, and I'm sure Vladimir will chime in
> > soon. :)
> >
> > Reseek is expensive compared to next if next would get us the KV we're
> > looking for. However, HBase does not know that ahead of time. There might
> > be a 1000 versions of the previous KV to be skipped first.
> > HBase seeks in three situation:
> > 1. Seek to the next column (there might be a lot of versions to skip)
> > 2. Seek to the next row (there might be a lot of versions and other
> > columns to skip)
> > 3. Seek to a row via a hint
> >
> > #3 is definitely useful, with that one can implement very efficient "skip
> > scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > #2 is helpful if there are many columns and one only "selects" a few (and
> > of course also if there are many versions of columns)
> > #1 is only helpful when we expect there to be many versions. Or of the
> > size of a typical KV aproaches the block size, since then we'd need to
> seek
> > to the find the next block anyway.
> >
> > You might well be a victim of #1. Are your rows 10-20 columns or is that
> > just the number of column you return?
> >
> > Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
> the
> > scanner to not seek to the next column or the next row, but just issue
> > next()'s until the KV is found. Another suggested approach (I think by
> the
> > Facebook guys) was to issue next() opportunistically a few times, and
> only
> > when that did not get us to ther requested KV issue a reseek.
> > I have also thought of a near/far designation of seeks. For near seeks
> > we'd do a configurable number of next()'s first, then seek.
> > "near" seeks would be those of category #1 (and maybe #2) above.
> >
> > See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> >
> > I'll look at the trace a bit closers.
> > So far my scan profiling has been focused on data in the blockcache since
> > in the normal case the vast majority of all data is found there and only
> > recent changes are in the memstore.
> >
> > -- Lars
> >
> >
> >
> >
> > ________________________________
> >  From: Varun Sharma <va...@pinterest.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>; "
> dev@hbase.apache.org"
> > <de...@hbase.apache.org>
> > Sent: Sunday, January 26, 2014 1:14 PM
> > Subject: Sporadic memstore slowness for Read Heavy workloads
> >
> >
> > Hi,
> >
> > We are seeing some unfortunately low performance in the memstore - we
> have
> > researched some of the previous JIRA(s) and seen some inefficiencies in
> the
> > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> > weird points in time - the bug is hard to reproduce and there isn't like
> a
> > huge # of extra reads going to that region server or any substantial
> > hotspot happening. The region server recovers the moment, we flush the
> > memstores or restart the region server. Our queries retrieve wide rows
> > which are upto 10-20 columns. A stack trace shows two things:
> >
> > 1) Time spent inside MemstoreScanner.reseek() and inside the
> > ConcurrentSkipListMap
> > 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > StoreScanner - this is understandable since the rows contain many columns
> > and StoreScanner iterates one KeyValue at a time.
> >
> > So, I was looking at the code and it seems that every single time there
> is
> > a reseek call on the same memstore scanner, we make a fresh call to build
> > an iterator() on the skip list set - this means we an additional skip
> list
> > lookup for every column retrieved. SkipList lookups are O(n) and not
> O(1).
> >
> > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> number
> > if exceeded, do a lookup. However, it seems this behaviour was reverted
> by
> > HBASE 4195 and every next row/next column is now a reseek() and a skip
> list
> > lookup rather than being an iterator.
> >
> > Are there any strong reasons against having the previous behaviour of
> > scanning a small # of keys before degenerating to a skip list lookup ?
> > Seems like it would really help for sequential memstore scans and for
> > memstore gets with wide tables (even 10-20 columns).
> >
> > Thanks
> > Varun
> >
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Hi lars,

Thanks for the background. It seems that for our case, we will have to
consider some solution like the Facebook one, since the next column is
always the next one - this can be a simple flag. I am going to raise a JIRA
and we can discuss there.

Thanks
Varun


On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:

> This is somewhat of a known issue, and I'm sure Vladimir will chime in
> soon. :)
>
> Reseek is expensive compared to next if next would get us the KV we're
> looking for. However, HBase does not know that ahead of time. There might
> be a 1000 versions of the previous KV to be skipped first.
> HBase seeks in three situation:
> 1. Seek to the next column (there might be a lot of versions to skip)
> 2. Seek to the next row (there might be a lot of versions and other
> columns to skip)
> 3. Seek to a row via a hint
>
> #3 is definitely useful, with that one can implement very efficient "skip
> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> #2 is helpful if there are many columns and one only "selects" a few (and
> of course also if there are many versions of columns)
> #1 is only helpful when we expect there to be many versions. Or of the
> size of a typical KV aproaches the block size, since then we'd need to seek
> to the find the next block anyway.
>
> You might well be a victim of #1. Are your rows 10-20 columns or is that
> just the number of column you return?
>
> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the
> scanner to not seek to the next column or the next row, but just issue
> next()'s until the KV is found. Another suggested approach (I think by the
> Facebook guys) was to issue next() opportunistically a few times, and only
> when that did not get us to ther requested KV issue a reseek.
> I have also thought of a near/far designation of seeks. For near seeks
> we'd do a configurable number of next()'s first, then seek.
> "near" seeks would be those of category #1 (and maybe #2) above.
>
> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>
> I'll look at the trace a bit closers.
> So far my scan profiling has been focused on data in the blockcache since
> in the normal case the vast majority of all data is found there and only
> recent changes are in the memstore.
>
> -- Lars
>
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "dev@hbase.apache.org"
> <de...@hbase.apache.org>
> Sent: Sunday, January 26, 2014 1:14 PM
> Subject: Sporadic memstore slowness for Read Heavy workloads
>
>
> Hi,
>
> We are seeing some unfortunately low performance in the memstore - we have
> researched some of the previous JIRA(s) and seen some inefficiencies in the
> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> weird points in time - the bug is hard to reproduce and there isn't like a
> huge # of extra reads going to that region server or any substantial
> hotspot happening. The region server recovers the moment, we flush the
> memstores or restart the region server. Our queries retrieve wide rows
> which are upto 10-20 columns. A stack trace shows two things:
>
> 1) Time spent inside MemstoreScanner.reseek() and inside the
> ConcurrentSkipListMap
> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> StoreScanner - this is understandable since the rows contain many columns
> and StoreScanner iterates one KeyValue at a time.
>
> So, I was looking at the code and it seems that every single time there is
> a reseek call on the same memstore scanner, we make a fresh call to build
> an iterator() on the skip list set - this means we an additional skip list
> lookup for every column retrieved. SkipList lookups are O(n) and not O(1).
>
> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
> if exceeded, do a lookup. However, it seems this behaviour was reverted by
> HBASE 4195 and every next row/next column is now a reseek() and a skip list
> lookup rather than being an iterator.
>
> Are there any strong reasons against having the previous behaviour of
> scanning a small # of keys before degenerating to a skip list lookup ?
> Seems like it would really help for sequential memstore scans and for
> memstore gets with wide tables (even 10-20 columns).
>
> Thanks
> Varun
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by Varun Sharma <va...@pinterest.com>.
Hi lars,

Thanks for the background. It seems that for our case, we will have to
consider some solution like the Facebook one, since the next column is
always the next one - this can be a simple flag. I am going to raise a JIRA
and we can discuss there.

Thanks
Varun


On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <la...@apache.org> wrote:

> This is somewhat of a known issue, and I'm sure Vladimir will chime in
> soon. :)
>
> Reseek is expensive compared to next if next would get us the KV we're
> looking for. However, HBase does not know that ahead of time. There might
> be a 1000 versions of the previous KV to be skipped first.
> HBase seeks in three situation:
> 1. Seek to the next column (there might be a lot of versions to skip)
> 2. Seek to the next row (there might be a lot of versions and other
> columns to skip)
> 3. Seek to a row via a hint
>
> #3 is definitely useful, with that one can implement very efficient "skip
> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> #2 is helpful if there are many columns and one only "selects" a few (and
> of course also if there are many versions of columns)
> #1 is only helpful when we expect there to be many versions. Or of the
> size of a typical KV aproaches the block size, since then we'd need to seek
> to the find the next block anyway.
>
> You might well be a victim of #1. Are your rows 10-20 columns or is that
> just the number of column you return?
>
> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the
> scanner to not seek to the next column or the next row, but just issue
> next()'s until the KV is found. Another suggested approach (I think by the
> Facebook guys) was to issue next() opportunistically a few times, and only
> when that did not get us to ther requested KV issue a reseek.
> I have also thought of a near/far designation of seeks. For near seeks
> we'd do a configurable number of next()'s first, then seek.
> "near" seeks would be those of category #1 (and maybe #2) above.
>
> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>
> I'll look at the trace a bit closers.
> So far my scan profiling has been focused on data in the blockcache since
> in the normal case the vast majority of all data is found there and only
> recent changes are in the memstore.
>
> -- Lars
>
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>; "dev@hbase.apache.org"
> <de...@hbase.apache.org>
> Sent: Sunday, January 26, 2014 1:14 PM
> Subject: Sporadic memstore slowness for Read Heavy workloads
>
>
> Hi,
>
> We are seeing some unfortunately low performance in the memstore - we have
> researched some of the previous JIRA(s) and seen some inefficiencies in the
> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> weird points in time - the bug is hard to reproduce and there isn't like a
> huge # of extra reads going to that region server or any substantial
> hotspot happening. The region server recovers the moment, we flush the
> memstores or restart the region server. Our queries retrieve wide rows
> which are upto 10-20 columns. A stack trace shows two things:
>
> 1) Time spent inside MemstoreScanner.reseek() and inside the
> ConcurrentSkipListMap
> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> StoreScanner - this is understandable since the rows contain many columns
> and StoreScanner iterates one KeyValue at a time.
>
> So, I was looking at the code and it seems that every single time there is
> a reseek call on the same memstore scanner, we make a fresh call to build
> an iterator() on the skip list set - this means we an additional skip list
> lookup for every column retrieved. SkipList lookups are O(n) and not O(1).
>
> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
> if exceeded, do a lookup. However, it seems this behaviour was reverted by
> HBASE 4195 and every next row/next column is now a reseek() and a skip list
> lookup rather than being an iterator.
>
> Are there any strong reasons against having the previous behaviour of
> scanning a small # of keys before degenerating to a skip list lookup ?
> Seems like it would really help for sequential memstore scans and for
> memstore gets with wide tables (even 10-20 columns).
>
> Thanks
> Varun
>

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by lars hofhansl <la...@apache.org>.
This is somewhat of a known issue, and I'm sure Vladimir will chime in soon. :)

Reseek is expensive compared to next if next would get us the KV we're looking for. However, HBase does not know that ahead of time. There might be a 1000 versions of the previous KV to be skipped first.
HBase seeks in three situation:
1. Seek to the next column (there might be a lot of versions to skip)
2. Seek to the next row (there might be a lot of versions and other columns to skip)
3. Seek to a row via a hint

#3 is definitely useful, with that one can implement very efficient "skip scans" (see the FuzzyRowFilter and what Phoenix is doing).
#2 is helpful if there are many columns and one only "selects" a few (and of course also if there are many versions of columns)
#1 is only helpful when we expect there to be many versions. Or of the size of a typical KV aproaches the block size, since then we'd need to seek to the find the next block anyway.

You might well be a victim of #1. Are your rows 10-20 columns or is that just the number of column you return?

Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the scanner to not seek to the next column or the next row, but just issue next()'s until the KV is found. Another suggested approach (I think by the Facebook guys) was to issue next() opportunistically a few times, and only when that did not get us to ther requested KV issue a reseek.
I have also thought of a near/far designation of seeks. For near seeks we'd do a configurable number of next()'s first, then seek.
"near" seeks would be those of category #1 (and maybe #2) above.

See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)

I'll look at the trace a bit closers.
So far my scan profiling has been focused on data in the blockcache since in the normal case the vast majority of all data is found there and only recent changes are in the memstore.

-- Lars




________________________________
 From: Varun Sharma <va...@pinterest.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; "dev@hbase.apache.org" <de...@hbase.apache.org> 
Sent: Sunday, January 26, 2014 1:14 PM
Subject: Sporadic memstore slowness for Read Heavy workloads
 

Hi,

We are seeing some unfortunately low performance in the memstore - we have
researched some of the previous JIRA(s) and seen some inefficiencies in the
ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
weird points in time - the bug is hard to reproduce and there isn't like a
huge # of extra reads going to that region server or any substantial
hotspot happening. The region server recovers the moment, we flush the
memstores or restart the region server. Our queries retrieve wide rows
which are upto 10-20 columns. A stack trace shows two things:

1) Time spent inside MemstoreScanner.reseek() and inside the
ConcurrentSkipListMap
2) The reseek() is being called at the "SEEK_NEXT" column inside
StoreScanner - this is understandable since the rows contain many columns
and StoreScanner iterates one KeyValue at a time.

So, I was looking at the code and it seems that every single time there is
a reseek call on the same memstore scanner, we make a fresh call to build
an iterator() on the skip list set - this means we an additional skip list
lookup for every column retrieved. SkipList lookups are O(n) and not O(1).

Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
if exceeded, do a lookup. However, it seems this behaviour was reverted by
HBASE 4195 and every next row/next column is now a reseek() and a skip list
lookup rather than being an iterator.

Are there any strong reasons against having the previous behaviour of
scanning a small # of keys before degenerating to a skip list lookup ?
Seems like it would really help for sequential memstore scans and for
memstore gets with wide tables (even 10-20 columns).

Thanks
Varun

Re: Sporadic memstore slowness for Read Heavy workloads

Posted by lars hofhansl <la...@apache.org>.
This is somewhat of a known issue, and I'm sure Vladimir will chime in soon. :)

Reseek is expensive compared to next if next would get us the KV we're looking for. However, HBase does not know that ahead of time. There might be a 1000 versions of the previous KV to be skipped first.
HBase seeks in three situation:
1. Seek to the next column (there might be a lot of versions to skip)
2. Seek to the next row (there might be a lot of versions and other columns to skip)
3. Seek to a row via a hint

#3 is definitely useful, with that one can implement very efficient "skip scans" (see the FuzzyRowFilter and what Phoenix is doing).
#2 is helpful if there are many columns and one only "selects" a few (and of course also if there are many versions of columns)
#1 is only helpful when we expect there to be many versions. Or of the size of a typical KV aproaches the block size, since then we'd need to seek to the find the next block anyway.

You might well be a victim of #1. Are your rows 10-20 columns or is that just the number of column you return?

Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the scanner to not seek to the next column or the next row, but just issue next()'s until the KV is found. Another suggested approach (I think by the Facebook guys) was to issue next() opportunistically a few times, and only when that did not get us to ther requested KV issue a reseek.
I have also thought of a near/far designation of seeks. For near seeks we'd do a configurable number of next()'s first, then seek.
"near" seeks would be those of category #1 (and maybe #2) above.

See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)

I'll look at the trace a bit closers.
So far my scan profiling has been focused on data in the blockcache since in the normal case the vast majority of all data is found there and only recent changes are in the memstore.

-- Lars




________________________________
 From: Varun Sharma <va...@pinterest.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>; "dev@hbase.apache.org" <de...@hbase.apache.org> 
Sent: Sunday, January 26, 2014 1:14 PM
Subject: Sporadic memstore slowness for Read Heavy workloads
 

Hi,

We are seeing some unfortunately low performance in the memstore - we have
researched some of the previous JIRA(s) and seen some inefficiencies in the
ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
weird points in time - the bug is hard to reproduce and there isn't like a
huge # of extra reads going to that region server or any substantial
hotspot happening. The region server recovers the moment, we flush the
memstores or restart the region server. Our queries retrieve wide rows
which are upto 10-20 columns. A stack trace shows two things:

1) Time spent inside MemstoreScanner.reseek() and inside the
ConcurrentSkipListMap
2) The reseek() is being called at the "SEEK_NEXT" column inside
StoreScanner - this is understandable since the rows contain many columns
and StoreScanner iterates one KeyValue at a time.

So, I was looking at the code and it seems that every single time there is
a reseek call on the same memstore scanner, we make a fresh call to build
an iterator() on the skip list set - this means we an additional skip list
lookup for every column retrieved. SkipList lookups are O(n) and not O(1).

Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
if exceeded, do a lookup. However, it seems this behaviour was reverted by
HBASE 4195 and every next row/next column is now a reseek() and a skip list
lookup rather than being an iterator.

Are there any strong reasons against having the previous behaviour of
scanning a small # of keys before degenerating to a skip list lookup ?
Seems like it would really help for sequential memstore scans and for
memstore gets with wide tables (even 10-20 columns).

Thanks
Varun