You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2019/02/27 06:33:00 UTC

[jira] [Commented] (PHOENIX-5157) Why is a Row timestamp column table slower than a regular table?

    [ https://issues.apache.org/jira/browse/PHOENIX-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778959#comment-16778959 ] 

Lars Hofhansl commented on PHOENIX-5157:
----------------------------------------

Could you share your code to load the data?

Questions/Comments:
# Does is make a difference whether the table is salted or not?
# Did you flush the table? (If not much data might reside still in the memstore)
# Did you major compact the table with the default compaction settings? (If so, the date is in a single HFile per region anyway, and HBase cannot filter by HFile).
# You're not actually filtering on the timestamp.

It's still disconcerting that a flat scan is so much slower.
If you can share your data loaded I'll commit some time to run this through a profiler.

> Why is a Row timestamp column table slower than a regular table?
> ----------------------------------------------------------------
>
>                 Key: PHOENIX-5157
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5157
>             Project: Phoenix
>          Issue Type: Test
>    Affects Versions: 4.14.0
>         Environment: ||Sortware||Version||
> |Phoenix|4.14.0-cdh5.11.2|
> |Hbase|1.2.0-cdh5.11.1|
> |Java|1.8.0_121|
> |Region Servers Number|4|
>            Reporter: wxmimperio
>            Priority: Major
>              Labels: test
>
> create regular table like this:
> {code:java}
> CREATE TABLE phoenix_dev.test_table1 (
> 	event_time TIMESTAMP NOT NULL /*event_time*/,
> 	"_KEY" BIGINT NOT NULL /*_key*/,
> 	col1 BIGINT /*col1*/,
> 	col2 BIGINT /*col2*/,
> 	col3 BIGINT /*col3*/,
> 	col4 BIGINT /*col4*/,
> 	col5 VARCHAR /*col5*/,
> 	CONSTRAINT pk PRIMARY KEY (event_time ASC, "_KEY" ASC)
> ) COMPRESSION = SNAPPY,
>  SALT_BUCKETS = 4,
>  VERSIONS = 1
> {code}
> create Row timestamp table like this:
> {code:java}
> CREATE TABLE phoenix_dev.test_table2 (
> 	event_time TIMESTAMP NOT NULL /*event_time*/,
> 	"_KEY" BIGINT NOT NULL /*_key*/,
> 	col1 BIGINT /*col1*/,
> 	col2 BIGINT /*col2*/,
> 	col3 BIGINT /*col3*/,
> 	col4 BIGINT /*col4*/,
> 	col5 VARCHAR /*col5*/,
> 	CONSTRAINT pk PRIMARY KEY (event_time ASC ROW_TIMESTAMP, "_KEY" ASC)
> ) COMPRESSION = SNAPPY,
>  SALT_BUCKETS = 4,
>  VERSIONS = 1
> {code}
> Except for `event_time ASC ROW_TIMESTAMP`, everything else is identical and Insert 2000w the same data.
> h2. *{color:#ff0000}Test SQL1:{color}*
> *Regular*
> {code:java}
> explain select count(*) from PHOENIX_DEV.test_table1;
> +-------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> |                                                       PLAN                                                        | EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO_TS   |
> +-------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | CLIENT 72-CHUNK 18453404 ROWS 20132661420 BYTES PARALLEL 4-WAY FULL SCAN OVER PHOENIX_DEV:TEST_TABLE1  | 20132661420     | 18453404       | 1550646317452  |
> |     SERVER FILTER BY FIRST KEY ONLY                                                                               | 20132661420     | 18453404       | 1550646317452  |
> |     SERVER AGGREGATE INTO SINGLE ROW                                                                              | 20132661420     | 18453404       | 1550646317452  |
> +-------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> select count(*) from PHOENIX_DEV.test_table1; 
> +-----------+
> | COUNT(1) |
> +-----------+
> | 20000000 |
> +-----------+
> {code}
> {color:#d04437}cost:5.267 seconds{color}
> *ROW_TIMESTAMP:*
> {code:java}
> explain select count(*) from PHOENIX_DEV.test_table2;
> +--------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> |                                                              PLAN                                                              | EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO_TS   |
> +--------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | CLIENT 58-CHUNK 14416723 ROWS 15728641792 BYTES PARALLEL 4-WAY FULL SCAN OVER PHOENIX_DEV:TEST_TABLE2  | 15728641792     | 14416723       | 1550648443256  |
> |     ROW TIMESTAMP FILTER [0, 9223372036854775807)                                                                              | 15728641792     | 14416723       | 1550648443256  |
> |     SERVER FILTER BY FIRST KEY ONLY                                                                                            | 15728641792     | 14416723       | 1550648443256  |
> |     SERVER AGGREGATE INTO SINGLE ROW                                                                                           | 15728641792     | 14416723       | 1550648443256  |
> +--------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> select count(*) from PHOENIX_DEV.test_table2; 
> +-----------+
> | COUNT(1) |
> +-----------+
> | 20000000 |
> +-----------+
> {code}
> {color:#d04437}cost:20.896 seconds{color}
> h2. Test SQL2:
> *Regular*
> {code:java}
> explain select count(*),col1,CAST(floor(event_time, 'MINUTE', 1) AS timestamp) AS log_time from PHOENIX_DEV.TEST1 where event_time >= to_timestamp('2018-10-21 00:00:00') AND event_time <= to_timestamp('2018-11-21 00:00:00') and col2 = 'ddd' AND col1 = 'fff' group by col1,log_time order by log_time desc limit 5000;
> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | PLAN | EST_BYTES_READ | EST_ROWS_READ | EST_INFO_TS |
> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | CLIENT 48-CHUNK 12686716 ROWS 13841204812 BYTES PARALLEL 48-WAY REVERSE RANGE SCAN OVER PHOENIX_DEV:TEST1 [0,'2018-10-20 16:00:00.000'] - [3,'2018-11-20 16:00:00.000'] | 13841204812 | 12686716 | 1550646317452 |
> | SERVER FILTER BY (COL2 = 'ddd' AND COL1 = 'fff') | 13841204812 | 12686716 | 1550646317452 |
> | SERVER AGGREGATE INTO DISTINCT ROWS BY [TO_TIMESTAMP(FLOOR(TO_DATE(EVENT_TIME))), COL1] | 13841204812 | 12686716 | 1550646317452 |
> | CLIENT MERGE SORT | 13841204812 | 12686716 | 1550646317452 |
> | CLIENT 5000 ROW LIMIT | 13841204812 | 12686716 | 1550646317452 |
> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> select count(*),col1,CAST(floor(event_time, 'MINUTE', 1) AS timestamp) AS log_time from PHOENIX_DEV.TEST1 where event_time >= to_timestamp('2018-10-21 00:00:00') AND event_time <= to_timestamp('2018-11-21 00:00:00') and col2 = 'ddd' AND col1 = 'fff' group by col1,log_time order by log_time desc limit 5000;
> {code}
> {color:#d04437}cost:5.714 seconds{color}
> *ROW_TIMESTAMP:*
> {code:java}
> explain select count(*),col1,CAST(floor(event_time, 'MINUTE', 1) AS timestamp) AS log_time from PHOENIX_DEV.TEST2 where event_time >= to_timestamp('2018-10-21 00:00:00') AND event_time <= to_timestamp('2018-11-21 00:00:00') and col2 = 'ddd' AND col1 = 'fff' group by col1,log_time order by log_time desc limit 5000;
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | PLAN | EST_BYTES_READ | EST_ROWS_READ | EST_INFO_TS |
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | CLIENT 48-CHUNK 12686716 ROWS 13841204812 BYTES PARALLEL 48-WAY REVERSE RANGE SCAN OVER PHOENIX_DEV:TEST2 [0,'2018-10-20 16:00:00.000'] - [3,'2018-11-20 16:00:00.000'] | 13841204812 | 12686716 | 1550648443256 |
> | ROW TIMESTAMP FILTER [1540051200000, 1542729600001) | 13841204812 | 12686716 | 1550648443256 |
> | SERVER FILTER BY (col2 = 'ddd' AND col1 = 'fff') | 13841204812 | 12686716 | 1550648443256 |
> | SERVER AGGREGATE INTO DISTINCT ROWS BY [TO_TIMESTAMP(FLOOR(TO_DATE(EVENT_TIME))), col1] | 13841204812 | 12686716 | 1550648443256 |
> | CLIENT MERGE SORT | 13841204812 | 12686716 | 1550648443256 |
> | CLIENT 5000 ROW LIMIT | 13841204812 | 12686716 | 1550648443256 |
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> select count(*),col1,CAST(floor(event_time, 'MINUTE', 1) AS timestamp) AS log_time from PHOENIX_DEV.TEST2 where event_time >= to_timestamp('2018-10-21 00:00:00') AND event_time <= to_timestamp('2018-11-21 00:00:00') and col2 = 'ddd' AND col1 = 'fff' group by col1,log_time order by log_time desc limit 5000;
> {code}
> {color:#d04437}cost:19.391 seconds{color}
> {color:#ff0000}According to the row timestamp document:[http://phoenix.apache.org/rowtimestamp.html]{color}
> {color:#cccccc}With the help of this time range information, HBase on the server side is able to entirely skip those store files which do not have fall in the time range. This greatly improves performance especially when querying the tail-end of the data.{color}
> h1. *Why is the query slower after adding row timestamp?*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)