You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Boaz Ben-Zvi (JIRA)" <ji...@apache.org> on 2019/01/29 04:06:00 UTC

[jira] [Updated] (DRILL-7012) Make SelectionVectorRemover project only the needed columns

     [ https://issues.apache.org/jira/browse/DRILL-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boaz Ben-Zvi updated DRILL-7012:
--------------------------------
    Description: 
   A SelectionVectorRemover is often used after a filter, to copy into a newly allocated new batch only the "filtered out" rows. In some cases the columns used by the filter are not needed downstream; currently these columns are being needlessly allocated and copied, and later removed by a Project.

  _+Suggested improvement+_: The planner can pass the information about these columns to the SelectionVectorRemover, which would avoid this useless allocation and copy. The Planner would also eliminate that Project from the plan.

   Here is an example, the query:
{code:sql}
select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
{code}
And the result plan (trimmed for readability), where "l_orderkey" and "l_shipmode" are removed by the Project:
{noformat}
00-00 Screen : rowType = RecordType(ANY EXPR$0): 
 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
 00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
 00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY l_linenumber, ANY EXPR$0): 
 00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY l_linenumber, ANY l_quantity): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY *l_shipmode*, ANY l_linenumber, ANY l_quantity): 
 00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
 00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, `l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity):
{noformat}
The implementation will not be simple, as the relevant code (e.g., GenericSV2Copier) has no idea of specific columns.

  was:
   A SelectionVectorRemover is often used after a filter, to copy into a newly allocated new batch only the "filtered out" rows. In some cases the columns used by the filter are not needed downstream; currently these columns are being needlessly allocated and copied, and later removed by a Project.

  _+Suggested improvement+_: The planner can pass the information about these columns to the SelectionVectorRemover, which would avoid this useless allocation and copy. The Planner would also eliminate that Project from the plan.

   Here is an example, the query:
{code:java}
select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
{code}
And the result plan (trimmed for readability), where "l_orderkey" and "l_shipmode" are removed by the Project:
{noformat}
00-00 Screen : rowType = RecordType(ANY EXPR$0): 
 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
 00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
 00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY l_linenumber, ANY EXPR$0): 
 00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY l_linenumber, ANY l_quantity): 
 00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY *l_shipmode*, ANY l_linenumber, ANY l_quantity): 
 00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
 00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, `l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity):
{noformat}
The implementation will not be simple, as the relevant code (e.g., GenericSV2Copier) has no idea of specific columns.


> Make SelectionVectorRemover project only the needed columns
> -----------------------------------------------------------
>
>                 Key: DRILL-7012
>                 URL: https://issues.apache.org/jira/browse/DRILL-7012
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators, Query Planning &amp; Optimization
>    Affects Versions: 1.15.0
>            Reporter: Boaz Ben-Zvi
>            Priority: Minor
>
>    A SelectionVectorRemover is often used after a filter, to copy into a newly allocated new batch only the "filtered out" rows. In some cases the columns used by the filter are not needed downstream; currently these columns are being needlessly allocated and copied, and later removed by a Project.
>   _+Suggested improvement+_: The planner can pass the information about these columns to the SelectionVectorRemover, which would avoid this useless allocation and copy. The Planner would also eliminate that Project from the plan.
>    Here is an example, the query:
> {code:sql}
> select max(l_quantity) from cp.`tpch/lineitem.parquet` L where L.l_orderkey > 58999 and L.l_shipmode = 'TRUCK' group by l_linenumber ;
> {code}
> And the result plan (trimmed for readability), where "l_orderkey" and "l_shipmode" are removed by the Project:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY EXPR$0): 
>  00-01 Project(EXPR$0=[$0]) : rowType = RecordType(ANY EXPR$0): 
>  00-02 Project(EXPR$0=[$1]) : rowType = RecordType(ANY EXPR$0): 
>  00-03 HashAgg(group=[\{0}], EXPR$0=[MAX($1)]) : rowType = RecordType(ANY l_linenumber, ANY EXPR$0): 
>  00-04 *Project*(l_linenumber=[$2], l_quantity=[$3]) : rowType = RecordType(ANY l_linenumber, ANY l_quantity): 
>  00-05 *SelectionVectorRemover* : rowType = RecordType(ANY *l_orderkey*, ANY *l_shipmode*, ANY l_linenumber, ANY l_quantity): 
>  00-06 *Filter*(condition=[AND(>($0, 58999), =($1, 'TRUCK'))]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity): 
>  00-07 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]], selectionRoot=classpath:/tpch/lineitem.parquet, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`l_orderkey`, `l_shipmode`, `l_linenumber`, `l_quantity`]]]) : rowType = RecordType(ANY l_orderkey, ANY l_shipmode, ANY l_linenumber, ANY l_quantity):
> {noformat}
> The implementation will not be simple, as the relevant code (e.g., GenericSV2Copier) has no idea of specific columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)