You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aditya Allamraju (Jira)" <ji...@apache.org> on 2020/02/04 00:43:00 UTC

[jira] [Created] (DRILL-7566) Performance of Common Table Expression query-15

Aditya Allamraju created DRILL-7566:
---------------------------------------

             Summary: Performance of Common Table Expression query-15
                 Key: DRILL-7566
                 URL: https://issues.apache.org/jira/browse/DRILL-7566
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning &amp; Optimization
    Affects Versions: 1.16.0
            Reporter: Aditya Allamraju


Observed that the WITH clause is materializing the computation it did as many times as it is being referred in the main query. The purpose of CTE is defeated here.

For instance, Query-15 of TPC:
{code:java}
. . . . . . . . . . . . . . )> WITH revenue0(supplier_no , total_revenue) AS (
. . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))
. . . . . . . . . . . . . . )>     FROM lineitem
. . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
. . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY)
. . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
. . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
. . . . . . . . . . semicolon> FROM supplier, revenue0
. . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
. . . . . . . . . . semicolon>   AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)
. . . . . . . . . . semicolon> ORDER BY s_suppkey;
+-----------+--------------------+-------------------------------+-----------------+---------------+
| s_suppkey |       s_name       |           s_address           |     s_phone     | total_revenue |
+-----------+--------------------+-------------------------------+-----------------+---------------+
| 493       | Supplier#000000493 | 7tdI3AtlDll57sj5K48WLX j5RDbc | 21-252-702-2543 | 1779637.1723  |
+-----------+--------------------+-------------------------------+-----------------+---------------+
1 row selected (9.093 seconds)
apache drill (hive.tpch_text)>

{code}
I just performed the above on a small subset. You can see the behavior from the explain plan and query profile. But usually, CTE based queries are run on huge tables like in users case(each table running into few TB's!).

Explain plan for above query:
{code:java}


apache drill (hive.tpch_text)> explain plan for 
. . . . . . . . . . semicolon> WITH revenue0(supplier_no , total_revenue) AS (
. . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))
. . . . . . . . . . . . . . )>     FROM lineitem
. . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
. . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY)
. . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
. . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
. . . . . . . . . . semicolon> FROM supplier, revenue0
. . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
. . . . . . . . . . semicolon>   AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)
. . . . . . . . . . semicolon> ORDER BY s_suppkey;
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|                                       text                                       |                                       json                                       |
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| 00-00    Screen
00-01      Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$4])
00-02        SelectionVectorRemover
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$5])
00-05              Project(s_suppkey=[$3], s_name=[$4], s_address=[$5], s_phone=[$6], l_suppkey=[$0], EXPR$1=[$1], EXPR$0=[$2])
00-06                HashJoin(condition=[=($3, $0)], joinType=[inner], semi-join: =[false])
00-08                  HashJoin(condition=[=($1, $2)], joinType=[inner], semi-join: =[false])
00-10                    HashAgg(group=[{0}], EXPR$1=[SUM($1)])
00-12                      Project(l_suppkey=[$0], EXPR$1=[$1])
00-14                        HashToRandomExchange(dist0=[[$0]])
01-01                          UnorderedMuxExchange
03-01                            Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
03-02                              HashAgg(group=[{0}], EXPR$1=[SUM($1)])
03-03                                Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
03-04                                  SelectionVectorRemover
03-05                                    Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))])
03-06                                      Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
00-09                    StreamAgg(group=[{}], EXPR$0=[MAX($0)])
00-11                      Project(EXPR$1=[$1])
00-13                        HashAgg(group=[{0}], EXPR$1=[SUM($1)])
00-15                          Project(l_suppkey=[$0], EXPR$1=[$1])
00-16                            HashToRandomExchange(dist0=[[$0]])
02-01                              UnorderedMuxExchange
04-01                                Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
04-02                                  HashAgg(group=[{0}], EXPR$1=[SUM($1)])
04-03                                    Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
04-04                                      SelectionVectorRemover
04-05                                        Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))])
04-06                                          Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
00-07                  Scan(table=[[hive, tpch_text, supplier]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:supplier), columns=[`s_suppkey`, `s_name`, `s_address`, `s_phone`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/supplier], confProperties={}]])
{code}
 

Observe Step 03-06 and 04-06. One of them could be avoided.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)