You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2017/09/07 14:21:00 UTC

[jira] [Created] (DRILL-5773) Project pushdown into a subquery with select *

Jinfeng Ni created DRILL-5773:
---------------------------------

             Summary: Project pushdown into a subquery with select *
                 Key: DRILL-5773
                 URL: https://issues.apache.org/jira/browse/DRILL-5773
             Project: Apache Drill
          Issue Type: Improvement
            Reporter: Jinfeng Ni


If a subquery / table expression/ view has a `select *` and out query is requesting a subset of columns/fields, Drill currently does not do project pushdown into the subquery. As a result, the scan operator will return every column/field in the table, this would significantly impact query performance, especially if # of column/field is large.

For instance,
{code}
SELECT n_regionkey, count(*) AS cnt 
FROM (SELECT * FROM cp.`tpch/nation.parquet`) AS n 
GROUP BY n_regionkey;
{code} 

Here is the plan
{code}
00-00    Screen
00-01      Project(n_regionkey=[$0], cnt=[$1])
00-02        Project(n_regionkey=[$0], cnt=[$1])
00-03          HashAgg(group=[{0}], cnt=[COUNT()])
00-04            Project(n_regionkey=[ITEM($0, 'n_regionkey')])
00-05              Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, usedMetadataFile=false, columns=[`*`]]])
{code}

Notice that in Scan operator `columns = *`, indicating that it will read every column. 

From performance perspective, Drill should push project into subquery with select *.
 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)