You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/09/27 00:49:20 UTC

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

    [ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524649#comment-15524649 ] 

ASF GitHub Bot commented on DRILL-4905:
---------------------------------------

GitHub user ppadma opened a pull request:

    https://github.com/apache/drill/pull/597

    DRILL-4905: Push down the LIMIT to the parquet reader scan.

    For limit N query, where N is less than current default record batchSize (256K for all fixedlength, 32K otherwise), we still end up reading all 256K/32K rows from disk if rowGroup has that many rows. This  causes performance degradation especially when there are large number of columns. 
    This fix tries to address this problem by changing the record batchSize parquet record reader uses so we don't read more than what is needed.
    Also, added a sys option (store.parquet.record_batch_size) to be able to set record batch size.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ppadma/drill DRILL-4905

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #597
    
----
commit cd665ebdba11f8685ba446f5ec535c81ddd6edc7
Author: Padma Penumarthy <pp...@ppenumarthy-e653-mpr13.local>
Date:   2016-09-26T17:51:07Z

    DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the numbers of records read

----


> Push down the LIMIT to the parquet reader scan to limit the numbers of records read
> -----------------------------------------------------------------------------------
>
>                 Key: DRILL-4905
>                 URL: https://issues.apache.org/jira/browse/DRILL-4905
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.8.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>             Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to parquet reader.
> For queries like
> select * from <table> limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)