You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Marcin Januszkiewicz (JIRA)" <ji...@apache.org> on 2017/10/31 07:58:00 UTC
[jira] [Created] (PHOENIX-4336) SELECT + ORDER BY performance much worse than self-join

Marcin Januszkiewicz created PHOENIX-4336:
---------------------------------------------

             Summary: SELECT + ORDER BY performance much worse than self-join
                 Key: PHOENIX-4336
                 URL: https://issues.apache.org/jira/browse/PHOENIX-4336
             Project: Phoenix
          Issue Type: Bug
            Reporter: Marcin Januszkiewicz


We have a wide table with 100M records created with the following DDL:

{code:sql}
CREATE TABLE traces (
  rowkey VARCHAR PRIMARY KEY,
  time VARCHAR,
  number VARCHAR,
  +40 more columns)

CREATE LOCAL INDEX ix_0 ON traces (UPPER(number)) INCLUDE (time, + some other columns used for filtering)
{code}

We want to select into a large (~30M records) subset of this data with the query:

{code:sql}
SELECT *all columns*
  FROM traces
  WHERE (UPPER(number) LIKE 'PO %')
  ORDER BY time DESC, ROWKEY
  LIMIT 101;
{code}


{noformat}
+-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
|                                                                PLAN                                                                 | EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO_TS   |
+-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
| CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY RANGE SCAN OVER               TRACES       [1,'PO '] - [1,'PO!']  | 73610048115     | 39850892       | 1509102519122  |
|     SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]                                                                        | 73610048115     | 39850892       | 1509102519122  |
| CLIENT MERGE SORT                                                                                                                   | 73610048115     | 39850892       | 1509102519122  |
| CLIENT LIMIT 101                                                                                                                    | 73610048115     | 39850892       | 1509102519122  |
+-------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
{noformat}

This times out after 15 minutes and puts a huge load on our cluster.
We have an alternate way of selecting this data:

{code:sql}
SELECT t.rowkey, *all columns*
FROM TRACES t
JOIN (
  SELECT rowkey
  FROM TRACES
  WHERE (UPPER(number) LIKE 'PO %')
  ORDER BY time DESC, ROWKEY
  LIMIT 101
) ix
ON t.ROWKEY = ix.ROWKEY
order by t.ROWKEY;
{code}

{noformat}
+---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
|                                                                    PLAN                                                                     | EST_BYTES_READ  | EST_ROWS_READ  |  EST_INFO |
+---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
| CLIENT 1500-CHUNK 97154640 ROWS 462422573830 BYTES PARALLEL 5-WAY FULL SCAN OVER               TRACES                                       | 73610048115     | 39850892       | 150910251 |
| CLIENT MERGE SORT                                                                                                                           | 73610048115     | 39850892       | 150910251 |
|     PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE)                                                                                                | 73610048115     | 39850892       | 150910251 |
|         CLIENT 234-CHUNK 39850892 ROWS 73610048115 BYTES PARALLEL 234-WAY RANGE SCAN OVER               TRACES       [1,'PO '] - [1,'PO!']  | 73610048115     | 39850892       | 150910251 |
|             SERVER TOP 101 ROWS SORTED BY [cf."time" DESC, "ROWKEY"]                                                                        | 73610048115     | 39850892       | 150910251 |
|         CLIENT MERGE SORT                                                                                                                   | 73610048115     | 39850892       | 150910251 |
|         CLIENT LIMIT 101                                                                                                                    | 73610048115     | 39850892       | 150910251 |
|     DYNAMIC SERVER FILTER BY T.ROWKEY IN (IX.ROWKEY)                                                                                        | 73610048115     | 39850892       | 150910251 |
+---------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------+-----------+
{noformat}

Which completes in just under a minute.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)