You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2019/03/14 16:58:24 UTC

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Hello Michael Ho, Lars Volker, Pooja Nilangekar, Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12065

to look at the new patch set (#6).

Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................

IMPALA-5843: Use page index in Parquet files to skip pages

This commit implements page filtering based on the Parquet page index.

The read and evaluation of the page index is done by the
HdfsParquetScanner. At first, we determine the row ranges we are
interested in, and based on the row ranges we determine the candidate
pages for each column that we are reading.

We still issue one ScanRange per column chunk, but we specify
sub-ranges that store the candidate pages, i.e. we don't read
the whole column chunk, but only fractions of it.

Pages are not aligned across column chunks, i.e. page #2 of column A
might store completely different rows than page #2 of column B.
It means we need to implement some kind of row-skipping logic
when we read the data pages. This logic is implemented in
BaseScalarColumnReader and ScalarColumnReader. Collection column
readers know nothing about page filtering.

Page filtering can be turned off by setting the query option
'read_parquet_page_index' to false.

Testing:
 * added added some unit tests for the row range and
   page selection logic
 * generated various Parquet files with Parquet-MR
 * enabled Page index writing and wrote selective queries against
   tables written by Impala. Current tests are likely to use page
   filtering transparently.

Performance:
 * measured locally, observed 3x to 10x speedup for selective queries
 TODO:
   * run standard benchmarks
   * measure performance for remote reads

Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
---
M be/src/common/global-flags.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
A be/src/exec/parquet/parquet-common-test.cc
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-level-decoder.h
A be/src/exec/parquet/parquet-page-index.cc
A be/src/exec/parquet/parquet-page-index.h
M be/src/exprs/literal.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/dict-encoding.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M testdata/data/README
A testdata/data/alltypes_tiny_pages.parquet
A testdata/data/decimals_1_10.parquet
A testdata/data/double_nested_decimals.parquet
A testdata/data/nested_decimals.parquet
A testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-large.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test
M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test
M tests/common/test_result_verifier.py
M tests/query_test/test_parquet_stats.py
34 files changed, 2,663 insertions(+), 81 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/12065/6
-- 
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>