You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/03/13 22:47:41 UTC
[jira] [Commented] (DRILL-5351) Excessive bounds checking in the Parquet reader

    [ https://issues.apache.org/jira/browse/DRILL-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923111#comment-15923111 ] 

ASF GitHub Bot commented on DRILL-5351:
---------------------------------------

GitHub user parthchandra opened a pull request:

    https://github.com/apache/drill/pull/781

    DRILL-5351: Minimize bounds checking in var len vectors for Parquet

    reader
    
    Two changes in var len vectors: 
    1) Instead of checking to see if we need to realloc for every setSafe call, let the write fail and catch the exception. The exception, though expensive, will happen very rarely.
    2) Call fillEmpties only if there are empty values to fill. 
    This saves a bunch of CPU on every setSafe call.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/parthchandra/drill DRILL-5351

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #781
    
----
commit 57869496526a43351575d0f4879d2ac28fe973d4
Author: Parth Chandra <pc...@maprtech.com>
Date:   2017-02-11T01:40:25Z

    DRILL-5351: Minimize bounds checking in var len vectors for Parquet
    reader

----


> Excessive bounds checking in the Parquet reader 
> ------------------------------------------------
>
>                 Key: DRILL-5351
>                 URL: https://issues.apache.org/jira/browse/DRILL-5351
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Parth Chandra
>
> In profiling the Parquet reader, the variable length decoding appears to be a major bottleneck making the reader CPU bound rather than disk bound.
> A yourkit profile indicates the following methods being severe bottlenecks -
> VarLenBinaryReader.determineSizeSerial(long)
>   NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
>   DrillBuf.chk(int, int)
>   NullableVarBinaryVector$Mutator.fillEmpties()
> The problem is that each of these methods does some form of bounds checking and eventually of course, the actual write to the ByteBuf is also bounds checked.
> DrillBuf.chk can be disabled by a configuration setting. Disabling this does improve performance of TPCH queries. In addition, all regression, unit, and TPCH-SF100 tests pass. 
> I would recommend we allow users to turn this check off if there are performance critical queries.
> Removing the bounds checking at every level is going to be a fair amount of work. In the meantime, it appears that a few simple changes to variable length vectors improves query performance by about 10% across the board. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)