You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Solvannan R M <so...@zoho.com.INVALID> on 2019/09/10 15:35:43 UTC

HBase Scan consumes high cpu

Hi,

   We have been using HBase (1.4.9) for a case where timeseries data is continuously inserted and deleted (high churn) against a single rowkey. The column keys would represent timestamp more or less. When we scan this data using ColumnRangeFilter for a recent time-range, scanner for the stores (memstore & storefiles) has to go through contiguous deletes, before it reaches the requested timerange data. While using this scan, we could notice 100% cpu usages in single core by the regionserver process.

So, for our case, most of the cells with older timestamps will be in deleted state. While traversing these deleted cells, the regionserver process causing 100% cpu usage in single core.

We tried to trace the code for scan and we observed the following behaviour.

1. While scanner is initialized, it seeked all the store-scanners to the start of the rowkey.
2. Then it traverses the deleted cells and discards it (as it was deleted) one by one.
3. When it encounters a valid cell (put type), it applies the filter and it returns SEEK_TO_NEXT_USING_HINT.
4. Now the scanner seeks to the required key directly and returning the results quickly then.

For confirming the mentioned behaviour, we have done a test:
1. We have populated a single rowkey with column qualifier as a range of integers of 0 to 1500000 with random data.
2. We then deleted the column qualifier range of 0 to 1499000.
3. Now the data is only in memsore. No store file exists.
4. Now we scanned the rowkey with ColumnRangeFilter[1499000, 1499010).
5. The query took 12 seconds to execute. During this query, a single core is completely used
6. Then we put a new cell with qualifier 10.
7. Executed the same query, it took 0.018 seconds to execute.

Kindly check this and advise !.

Regards,
Solvannan R M


Re: HBase Scan consumes high cpu

Posted by Josh Elser <el...@apache.org>.
Deletes are held in memory. They represent data you have to traverse 
until that data is flushed out to disk. When you write a new cell with a 
qualifier of 10, that sorts, lexicographically, "early" with respect to 
the other qualifiers you've written.

By that measure, if you are only scanning for the first column in this 
row which you've loaded with deletes, it would make total sense to me 
that the first case is slow and the second fast is fast

Can you please share exactly how you execute your "query" for both(all) 
scenarios?

On 9/10/19 11:35 AM, Solvannan R M wrote:
> Hi,
> 
>     We have been using HBase (1.4.9) for a case where timeseries data is continuously inserted and deleted (high churn) against a single rowkey. The column keys would represent timestamp more or less. When we scan this data using ColumnRangeFilter for a recent time-range, scanner for the stores (memstore & storefiles) has to go through contiguous deletes, before it reaches the requested timerange data. While using this scan, we could notice 100% cpu usages in single core by the regionserver process.
> 
> So, for our case, most of the cells with older timestamps will be in deleted state. While traversing these deleted cells, the regionserver process causing 100% cpu usage in single core.
> 
> We tried to trace the code for scan and we observed the following behaviour.
> 
> 1. While scanner is initialized, it seeked all the store-scanners to the start of the rowkey.
> 2. Then it traverses the deleted cells and discards it (as it was deleted) one by one.
> 3. When it encounters a valid cell (put type), it applies the filter and it returns SEEK_TO_NEXT_USING_HINT.
> 4. Now the scanner seeks to the required key directly and returning the results quickly then.
> 
> For confirming the mentioned behaviour, we have done a test:
> 1. We have populated a single rowkey with column qualifier as a range of integers of 0 to 1500000 with random data.
> 2. We then deleted the column qualifier range of 0 to 1499000.
> 3. Now the data is only in memsore. No store file exists.
> 4. Now we scanned the rowkey with ColumnRangeFilter[1499000, 1499010).
> 5. The query took 12 seconds to execute. During this query, a single core is completely used
> 6. Then we put a new cell with qualifier 10.
> 7. Executed the same query, it took 0.018 seconds to execute.
> 
> Kindly check this and advise !.
> 
> Regards,
> Solvannan R M
>