You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2016/04/29 04:46:42 UTC

[Impala-CR](cdh5-trunk) Optimized ReadValueBatch() for Parquet scalar column readers.

Alex Behm has uploaded a new patch set (#2).

Change subject: Optimized ReadValueBatch() for Parquet scalar column readers.
......................................................................

Optimized ReadValueBatch() for Parquet scalar column readers.

This change builds on top of the recent move to column-wise
materialization of scalar values in the Parquet scanner.

The goal of this patch is to improve the scan efficiency, and
show the future direction for all column readers.

Major TODO:
The current patch has minor code duplication/redundancy,
and the new ReadValueBatch() departs from (but improves)
the existing scannre control flow. To improve code reuse and
readability we should overhaul all scanners to be more uniform.

Summary of changes:
- refactor ReadValueBatch() to simplify control flow
- introduce caching of def/rep levels for faster level
  decoding, and for a tigher value materialization loop
- new templated function for value materialization that
  takes the value encoding as a template argument

Mini benchmark vs. cdh5-trunk
I ran the following queries on a single impalad before and after my
change using a synthetic 'huge_lineitem' table.
I modified hdfs-scan-node.cc to set the number of rows of any row
batch to 0 to focus the measurement on the scan time.

Query options:
set num_scanner_threads=1;
set disable_codegen=true;
set num_nodes=1;

select * from huge_lineitem;
Before: 22.39s
Afer:   13.62s

select * from huge_lineitem where l_linenumber < 0;
Before: 25.11s
After:  17.73s

select * from huge_lineitem where l_linenumber % 2 = 0;
Before: 26.32s
After:  16.68s

select l_linenumber from huge_lineitem;
Before: 1.74s
After:  0.92s

Testing:
I ran a private core build and all tests passed.

Change-Id: I21fa9b050a45f2dd45cc0091ea5b008d3c0a3f30
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/rle-encoding.h
3 files changed, 291 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2843/2
-- 
To view, visit http://gerrit.cloudera.org:8080/2843
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21fa9b050a45f2dd45cc0091ea5b008d3c0a3f30
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>