You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Chao Shi (JIRA)" <ji...@apache.org> on 2013/10/29 04:13:30 UTC

[jira] [Commented] (HBASE-9811) ColumnPaginationFilter is slow when offset is large

    [ https://issues.apache.org/jira/browse/HBASE-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13807622#comment-13807622 ] 

Chao Shi commented on HBASE-9811:
---------------------------------

Here is some benchmark

1 row with 1M columns, and these columns are uniformly distributed into N hfiles 

scan with a ColumnPaginationFilter, with offset = 1M
hfiles=1 993.71 ms
hfiles=2 2251.69 ms
hfiles=3 4090.0 ms
hfiles=4 5770.72 ms

change ColumnPagninationFilter to return SKIP rather than SEEK_NEXT_COL
hfiles=1 243.88 ms
hfiles=2 1833.41 ms
hfiles=3 3691.35 ms
hfiles=4 5498.54 ms

I think we can find 2 problems from the above figures:
1) There is a huge improvement when there is only 1 hfile. This is the benefit of next vs reseek.
2) No much improvement when there is more than one hfiles. This may be due to the use of KeyValueHeap, as performance drops greatly as the number of hfiles grows. 

For problem 1), we can use similar trick as done in HBASE-9769.

> ColumnPaginationFilter is slow when offset is large
> ---------------------------------------------------
>
>                 Key: HBASE-9811
>                 URL: https://issues.apache.org/jira/browse/HBASE-9811
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Chao Shi
>
> Hi there, we are trying to migrate a app from MySQL to HBase. One kind of the queries is pagination with large offset and small limit. We don't have too many such queries and so both MySQL and HBase should survive. (MySQL has no index for offset either.)
> When comparing the performance on both systems, we found something interest: write ~1M values in a single row, and query with offset = 1M. So all values should be scanned on RS side.
> When running the query on MySQL, the first query is pretty slow (more than 1 second) and then repeat the same query, it will become very low latency.
> HBase on the other hand, repeating the query does not help much (~1s forever). I can confirm that all data are in block cache and all the time is spent on in-memory data processing. (We have flushed data to disk.)
> I found "reseek" is the hot spot. It is caused by ColumnPaginationFilter returning NEXT_COL. If I replace this line by returning SKIP (which causes to call next rather than reseek), the latency is reduced to ~100ms.
> So I think there must be some room for optimization.



--
This message was sent by Atlassian JIRA
(v6.1#6144)