You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/08/01 19:40:00 UTC

[jira] [Commented] (DRILL-6101) Optimize Implicit Columns Processing

    [ https://issues.apache.org/jira/browse/DRILL-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565874#comment-16565874 ] 

ASF GitHub Bot commented on DRILL-6101:
---------------------------------------

sachouche opened a new pull request #1414: DRILL-6101: Optimized implicit columns handling within scanner
URL: https://github.com/apache/drill/pull/1414
 
 
   Problem Description -
   
   File based implicit columns are projected only if explicitly requested within the query
   Note that Partition Columns are not included in this discussion (only referring about FILENAME, FILEPATH, FQN, and SUFFIX)
   The scanner operator is called with three sets of columns to handle: Table Columns, Partition Columns, and Implicit Columns
   When a SELECT_STAR is used, the operator doesn't receive the original query selection (only '**' is received)
   This behavior mandates that the Scanner operator projects all file based Implicit Columns only for these to be filtered out later on by the Project Operator
   Performance tests indicates this behavior introduces a 30% degradation within the scanning phase for some TPCH queries (this degradation is larger for tables with long paths)
   Fix -
   
   Noticed the code uses a Utility to figure out whether a selection is a STAR_QUERY; this utility expects a list of columns and attempts to detect the presence of the STAR selection keyword
   Modified the code to include all selection columns (including the ones in the where clause)
   This allowed the execution layer to invoke the Scan operator with the correct implicit columns (the ones explicitly listed within the query) and thus addressing this performance issue
   Note that readers are not impacted with the newly added metadata as the reader code doesn't use the columns list when a STAR_QUERY is involved

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Optimize Implicit Columns Processing
> ------------------------------------
>
>                 Key: DRILL-6101
>                 URL: https://issues.apache.org/jira/browse/DRILL-6101
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.12.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Critical
>
> Problem Description -
>  * Apache Drill allows users to specify columns even for SELECT STAR queries
>  * From my discussion with [~paul-rogers], Apache Calcite has a limitation where the, extra columns are not provided
>  * The workaround has been to always include all implicit columns for SELECT STAR queries
>  * Unfortunately, the current implementation is very inefficient as implicit column values get duplicated; this leads to substantial performance degradation when the number of rows are large
> Suggested Optimization -
>  * The NullableVarChar vector should be enhanced to efficiently store duplicate values
>  * This will not only address the current Calcite limitations (for SELECT STAR queries) but also optimize all queries with implicit columns
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)