You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/02/17 16:20:18 UTC

[jira] [Commented] (DRILL-4387) Improve execution side when it handles skipAll query

    [ https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150631#comment-15150631 ] 

ASF GitHub Bot commented on DRILL-4387:
---------------------------------------

GitHub user jinfengni opened a pull request:

    https://github.com/apache/drill/pull/379

    DRILL-4387: GroupScan or ScanBatchCreator should not use star column …

    …in case of skipAll query.
    
    The skipAll query should be handled in RecordReader.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinfengni/incubator-drill DRILL-4387

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #379
    
----
commit 5c1edc42dcad6c3b5943424b9a8373cf6ff51753
Author: Jinfeng Ni <jn...@apache.org>
Date:   2016-02-12T22:18:59Z

    DRILL-4387: GroupScan or ScanBatchCreator should not use star column in case of skipAll query.
    
    The skipAll query should be handled in RecordReader.

----


> Improve execution side when it handles skipAll query
> ----------------------------------------------------
>
>                 Key: DRILL-4387
>                 URL: https://issues.apache.org/jira/browse/DRILL-4387
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>             Fix For: 1.6.0
>
>
> DRILL-4279 changes the planner side and the RecordReader in the execution side when they handles skipAll query. However, it seems there are other places in the codebase that do not handle skipAll query efficiently. In particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty column list with star column. This essentially will force the execution side (RecordReader) to fetch all the columns for data source. Such behavior will lead to big performance overhead for the SCAN operator.
> To improve Drill's performance, we should change those places as well, as a follow-up work after DRILL-4279.
> One simple example of this problem is:
> {code}
>    SELECT DISTINCT substring(dir1, 5) from  dfs.`/Path/To/ParquetTable`;  
> {code}
> The query does not require any regular column from the parquet file. However, ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the column list. In case table has dozens or hundreds of columns, this will make SCAN operator much more expensive than necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)