You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Vova Vysotskyi (Jira)" <ji...@apache.org> on 2020/02/04 09:54:00 UTC

[jira] [Commented] (DRILL-7566) Performance of Common Table Expression query-15

    [ https://issues.apache.org/jira/browse/DRILL-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029700#comment-17029700 ] 

Vova Vysotskyi commented on DRILL-7566:
---------------------------------------

Drill handles CTE in the same way as regular views - it is treated a part of the query and is substituted in the place where it is specified.

The behavior you expect may be reached using temporary tables instead of CTE if data is unchanged.

Or, since it is used on top of Hive tables, Drill supports querying Hive materialized views ([~ihuzenko], please confirm and point out to Drill version where it was added), so you can create it in Hive and use in Drill queries.

By the way, Postgres some time ago changed default behavior to the same as Drill's one: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=608b167f9f9c4553c35bb1ec0eab9ddae643989b

> Performance of Common Table Expression query-15
> -----------------------------------------------
>
>                 Key: DRILL-7566
>                 URL: https://issues.apache.org/jira/browse/DRILL-7566
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning &amp; Optimization
>    Affects Versions: 1.16.0
>            Reporter: Aditya Allamraju
>            Priority: Major
>
> Observed that the WITH clause is materializing the computation it did as many times as it is being referred in the main query. The purpose of CTE is defeated here.
> For instance, Query-15 of TPC:
> {code:java}
> . . . . . . . . . . . . . . )> WITH revenue0(supplier_no , total_revenue) AS (
> . . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))
> . . . . . . . . . . . . . . )>     FROM lineitem
> . . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
> . . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY)
> . . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
> . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
> . . . . . . . . . . semicolon> FROM supplier, revenue0
> . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
> . . . . . . . . . . semicolon>   AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)
> . . . . . . . . . . semicolon> ORDER BY s_suppkey;
> +-----------+--------------------+-------------------------------+-----------------+---------------+
> | s_suppkey |       s_name       |           s_address           |     s_phone     | total_revenue |
> +-----------+--------------------+-------------------------------+-----------------+---------------+
> | 493       | Supplier#000000493 | 7tdI3AtlDll57sj5K48WLX j5RDbc | 21-252-702-2543 | 1779637.1723  |
> +-----------+--------------------+-------------------------------+-----------------+---------------+
> 1 row selected (9.093 seconds)
> apache drill (hive.tpch_text)>
> {code}
> I just performed the above on a small subset. You can see the behavior from the explain plan and query profile. But usually, CTE based queries are run on huge tables like in users case(each table running into few TB's!).
> Explain plan for above query:
> {code:java}
> apache drill (hive.tpch_text)> explain plan for 
> . . . . . . . . . . semicolon> WITH revenue0(supplier_no , total_revenue) AS (
> . . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 - l_discount))
> . . . . . . . . . . . . . . )>     FROM lineitem
> . . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
> . . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', INTERVAL '90' DAY)
> . . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
> . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
> . . . . . . . . . . semicolon> FROM supplier, revenue0
> . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
> . . . . . . . . . . semicolon>   AND total_revenue = (SELECT MAX(total_revenue) FROM revenue0)
> . . . . . . . . . . semicolon> ORDER BY s_suppkey;
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> |                                       text                                       |                                       json                                       |
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> | 00-00    Screen
> 00-01      Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$4])
> 00-02        SelectionVectorRemover
> 00-03          Sort(sort0=[$0], dir0=[ASC])
> 00-04            Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], total_revenue=[$5])
> 00-05              Project(s_suppkey=[$3], s_name=[$4], s_address=[$5], s_phone=[$6], l_suppkey=[$0], EXPR$1=[$1], EXPR$0=[$2])
> 00-06                HashJoin(condition=[=($3, $0)], joinType=[inner], semi-join: =[false])
> 00-08                  HashJoin(condition=[=($1, $2)], joinType=[inner], semi-join: =[false])
> 00-10                    HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 00-12                      Project(l_suppkey=[$0], EXPR$1=[$1])
> 00-14                        HashToRandomExchange(dist0=[[$0]])
> 01-01                          UnorderedMuxExchange
> 03-01                            Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
> 03-02                              HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 03-03                                Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
> 03-04                                  SelectionVectorRemover
> 03-05                                    Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))])
> 03-06                                      Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
> 00-09                    StreamAgg(group=[{}], EXPR$0=[MAX($0)])
> 00-11                      Project(EXPR$1=[$1])
> 00-13                        HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 00-15                          Project(l_suppkey=[$0], EXPR$1=[$1])
> 00-16                            HashToRandomExchange(dist0=[[$0]])
> 02-01                              UnorderedMuxExchange
> 04-01                                Project(l_suppkey=[$0], EXPR$1=[$1], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
> 04-02                                  HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 04-03                                    Project(l_suppkey=[$0], $f1=[*($1, -(1, $2))])
> 04-04                                      SelectionVectorRemover
> 04-05                                        Filter(condition=[AND(>=($3, '1996-07-01'), <($3, 1996-09-29 00:00:00))])
> 04-06                                          Scan(table=[[hive, tpch_text, lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, `l_shipdate`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
> 00-07                  Scan(table=[[hive, tpch_text, supplier]], groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:supplier), columns=[`s_suppkey`, `s_name`, `s_address`, `s_phone`], numPartitions=0, partitions= null, inputDirectories=[maprfs:/tpch/supplier], confProperties={}]])
> {code}
>  
> Observe Step 03-06 and 04-06. One of them could be avoided.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)