You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2018/10/10 18:55:00 UTC
[jira] [Commented] (PHOENIX-4336) SELECT + ORDER BY performance much worse than self-join

    [ https://issues.apache.org/jira/browse/PHOENIX-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645413#comment-16645413 ] 

Lars Hofhansl commented on PHOENIX-4336:
----------------------------------------

Just came here looking for something else.
HBase now the ROW_INDEX_V1 encoder. FAST_DIFF is not good for random GETs (done when we retrieve the remainder of the row), I'd recommend changing the encoding to ROW_INDEX_V1 and using block compression (zSTD is a great compromise between compression and speed)

> SELECT + ORDER BY performance much worse than self-join
> -------------------------------------------------------
>
>                 Key: PHOENIX-4336
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4336
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Marcin Januszkiewicz
>            Priority: Major
>
> We have a wide table with 100M records created with the following DDL:
> {code:sql}
> CREATE TABLE traces (
>   rowkey VARCHAR PRIMARY KEY,
>   time VARCHAR,
>   number VARCHAR,
>   +40 more columns)
> CREATE LOCAL INDEX ix_0 ON traces (UPPER(number)) INCLUDE (time, + some other columns used for filtering)
> {code}
> We want to select into a large (~30M records) subset of this data with the query:
> {code:sql}
> SELECT *all columns*
>   FROM traces
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101;
> {code}
> {noformat}
> +-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> |                                                                PLAN                                                                 | EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO_TS   |
> +-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> | CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY RANGE SCAN OVER               TRACES       [1,'PO '] - [1,'PO!']  | 73610048115     | 39850892       | 1509102519122  |
> |     SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]                                                                        | 73610048115     | 39850892       | 1509102519122  |
> | CLIENT MERGE SORT                                                                                                                   | 73610048115     | 39850892       | 1509102519122  |
> | CLIENT LIMIT 101                                                                                                                    | 73610048115     | 39850892       | 1509102519122  |
> +-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> {noformat}
> This times out after 15 minutes and puts a huge load on our cluster.
> We have an alternate way of selecting this data:
> {code:sql}
> SELECT t.rowkey, *all columns*
> FROM TRACES t
> JOIN (
>   SELECT rowkey
>   FROM TRACES
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101
> ) ix
> ON t.ROWKEY = ix.ROWKEY
> order by t.ROWKEY;
> {code}
> {noformat}
> +---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
> |                                                                    PLAN                                                                     | EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO |
> +---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
> | CLIENT 1500-CHUNK 97154640 ROWS 462422573830 BYTES PARALLEL 5-WAY FULL SCAN OVER               TRACES                                       | 73610048115     | 39850892       | 150910251 |
> | CLIENT MERGE SORT                                                                                                                           | 73610048115     | 39850892       | 150910251 |
> |     PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)                                                                                                | 73610048115     | 39850892       | 150910251 |
> |         CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY RANGE SCAN OVER               TRACES       [1,'PO '] - [1,'PO!']  | 73610048115     | 39850892       | 150910251 |
> |             SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]                                                                        | 73610048115     | 39850892       | 150910251 |
> |         CLIENT MERGE SORT                                                                                                                   | 73610048115     | 39850892       | 150910251 |
> |         CLIENT LIMIT 101                                                                                                                    | 73610048115     | 39850892       | 150910251 |
> |     DYNAMIC SERVER FILTER BY T.ROWKEY IN (IX.ROWKEY)                                                                                        | 73610048115     | 39850892       | 150910251 |
> +---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
> {noformat}
> Which completes in just under a minute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)