You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/12/04 02:30:00 UTC
[jira] [Commented] (DRILL-5846) Improve Parquet Reader Performance for Flat Data types

    [ https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276227#comment-16276227 ] 

ASF GitHub Bot commented on DRILL-5846:
---------------------------------------

GitHub user sachouche opened a pull request:

    https://github.com/apache/drill/pull/1060

    DRILL-5846: Improve parquet performance for Flat Data Types

    Performance improvements for the Parquet Scanner (Flat Data Types). The are two flags to control this performance enhancement (disabled by default):
    Option I -
    Config Name: store.parquet.flat.reader.bulk
    Config Type  : boolean
    Description   : Enables bulk processing to minimize memory checks and improve JVM HotSpot optimizations
    
    Option II -
    Config Name: scan.optimized.implicit.columns 
    Config Type  : boolean
    Description   : Memory optimization when storing duplicate value (implicit columns have duplicate values within a batch). Code profiling indicated this step represented one third of Parquet processing (when implicit columns are processed).  


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill drill-parquet-improv

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1060.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1060
    
----
commit 0a3a8b053be85c570ff24237d4737f37668383bd
Author: Salim Achouche <sa...@gmail.com>
Date:   2017-12-04T01:53:08Z

    DRILL-5846: Improve parquet performance for Flat Data Types

----


> Improve Parquet Reader Performance for Flat Data types 
> -------------------------------------------------------
>
>                 Key: DRILL-5846
>                 URL: https://issues.apache.org/jira/browse/DRILL-5846
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>              Labels: performance
>             Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to further improve the Parquet Reader performance as several users reported that Parquet parsing represents the lion share of the overall query execution. It tracks Flat Data types only as Nested DTs might involve functional and processing enhancements (e.g., a nested column can be seen as a Document; user might want to perform operations scoped at the document level that is no need to span all rows). Another JIRA will be created to handle the nested columns use-case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)