You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2014/03/03 00:00:20 UTC
[jira] [Comment Edited] (PHOENIX-76) Fix perf regression due to PHOENIX-29

    [ https://issues.apache.org/jira/browse/PHOENIX-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917593#comment-13917593 ] 

Lars Hofhansl edited comment on PHOENIX-76 at 3/2/14 10:59 PM:
---------------------------------------------------------------

I did some informal performance tests and I found that seeking is about 5-10x as expensive as calling next() on the scanner. I tested with very small column values, with larger values next() becomes proportionally more expensive.
(informal tests, because the exact outcome depends on the ratio of valuesize to keysize and keyvaluesize to HFile blocksize)

In addition what counts is the number of *gaps*. Consecutive columns have no extra cost beyond a call to next(), for example when the 3rd and 4th column are selected together.

So seeking is preferable if each consecutive range of selected columns then skips 5-10 columns and/or versions - for example selecting a single column out of a row where we expect 10 columns, or selecting 2 out of 20, or skipping a single column with 10 versions, or selecting column 1,2,3 out of 15 columns.

Some interesting data:
* selecting the first columns consecutively is cheap (i.e. 1st, 2nd, 3rd, 4th columns)... just as fast as the wildcard column tracker
* selecting the 3rd, 4th, and 5th column is hardly more expensive then just selecting the 3rd column alone
* (as said above) if a seek skips 5-10 KVs (i.e. columns or version) we should seek
* when column values are large (approaching the HFile block size of 64K) we should definitely seek

So the exact optimal place a bit hard to determine. I worry that this (and PHOENIX-29) might be a bit premature optimization based on too few performance tests. We might see terrible performance in many scenarios we have not tested as outlined above.

What would really help is if we could make sure that canonical column (right now it's "_", which sorts after capital letters) would always sort first... I.e. call it "$" or "!" or something. That should double the performance of count(1) for example.

Numbers: (not with Phoenix but HBase directly).
10m rows, 10 cols each, 8 bytes values, 10 bytes values, encoding = FAST_DIFF, exactly one version of each column, everything in the blockcache:
||Columns selected||none||1||1,2||1,2,3||2||2,3||2,3,4||2,4,6||1,2,3,4, 6,7,8,9,10||1,2,3,4,5,6,7,8,9,10|
|Scan time/s|19.5|13.0|14.5|21.1|18.2|19.8|21.1|31.7|25.8|22.0|

We should do more tests with (1) more versions and (2) longer values and (3) longer keys



was (Author: lhofhansl):
I did some informal performance tests and I found that seeking is about 5-10x as expensive as calling next() on the scanner. I tested with very small column values, with larger values next() becomes proportionally more expensive.
(informal tests, because the exact outcome depends on the ratio of valuesize to keysize and keyvaluesize to HFile blocksize)

In addition what counts is the number of *gaps*. Consecutive columns have no extra cost beyond a call to next(), for example when the 3rd and 4th column are selected together.

So seeking is preferable if each consecutive range of selected columns then skips 5-10 columns and/or versions - for example selecting a single column out of a row where we expect 10 columns, or selecting 2 out of 20, or skipping a single column with 10 versions, or selecting column 1,2,3 out of 5 columns.

Some interesting data:
* selecting the first columns consecutively is cheap (i.e. 1st, 2nd, 3rd, 4th columns)... just as fast as the wildcard column tracker
* selecting the 3rd, 4th, and 5th column is hardly more expensive then just selecting the 3rd column alone
* (as said above) if a seek skips 5-10 KVs (i.e. columns or version) we should seek
* when column values are large (approaching the HFile block size of 64K) we should definitely seek

So the exact optimal place a bit hard to determine. I worry that this (and phoenix-29) might be a bit premature optimization based on too few performance tests. We might see terrible performance in many scenarios we have not tested as outlined above.

What would really help is if we could make sure that canonical column (right now it's "_", which sorts after capital letters) would always sort first... I.e. call it "$" or "!" or something. That should double the performance of count(1) for example.

Numbers: (not with Phoenix but HBase directly).
10m rows, 10 cols each, 8 bytes values, 10 bytes values, encoding = FAST_DIFF, exactly one version of each column, everything in the blockcache:
||Columns selected||none||1||1,2||1,2,3||2||2,3||2,3,4||2,4,6||1,2,3,4, 6,7,8,9,10||1,2,3,4,5,6,7,8,9,10|
|Scan time/s|19.5|13.0|14.5|21.1|18.2|19.8|21.1|31.7|25.8|22.0|

We should do more tests with (1) more versions and (2) longer values and (3) longer keys


> Fix perf regression due to PHOENIX-29
> -------------------------------------
>
>                 Key: PHOENIX-76
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-76
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: James Taylor
>            Assignee: Anoop Sam John
>             Fix For: 3.0.0
>
>         Attachments: PHOENIX-76.patch
>
>
> Many queries got slower as a result of PHOENIX-29. There are a few simple checks we can do to prevent the adding of the new filter:
> - if the query is an aggregate query, as we don't return KVs in this case, so we're only doing extra processing that we don't need. For this, you can check statement.isAggregate().
> - if there are multiple column families  referenced in the where clause, as the seek that gets done is better in this case because we'd potentially be seeking over an entire stores worth of data into a different store.



--
This message was sent by Atlassian JIRA
(v6.2#6252)