You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Zelaine Fong (JIRA)" <ji...@apache.org> on 2017/03/16 16:35:41 UTC
[jira] [Assigned] (DRILL-5351) Excessive bounds checking in the
Parquet reader
[ https://issues.apache.org/jira/browse/DRILL-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zelaine Fong reassigned DRILL-5351:
-----------------------------------
Assignee: Parth Chandra
> Excessive bounds checking in the Parquet reader
> ------------------------------------------------
>
> Key: DRILL-5351
> URL: https://issues.apache.org/jira/browse/DRILL-5351
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Parth Chandra
> Assignee: Parth Chandra
>
> In profiling the Parquet reader, the variable length decoding appears to be a major bottleneck making the reader CPU bound rather than disk bound.
> A yourkit profile indicates the following methods being severe bottlenecks -
> VarLenBinaryReader.determineSizeSerial(long)
> NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
> DrillBuf.chk(int, int)
> NullableVarBinaryVector$Mutator.fillEmpties()
> The problem is that each of these methods does some form of bounds checking and eventually of course, the actual write to the ByteBuf is also bounds checked.
> DrillBuf.chk can be disabled by a configuration setting. Disabling this does improve performance of TPCH queries. In addition, all regression, unit, and TPCH-SF100 tests pass.
> I would recommend we allow users to turn this check off if there are performance critical queries.
> Removing the bounds checking at every level is going to be a fair amount of work. In the meantime, it appears that a few simple changes to variable length vectors improves query performance by about 10% across the board.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)