You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "guojingfeng (Code Review)" <ge...@cloudera.org> on 2020/11/12 10:27:12 UTC

[Impala-ASF-CR] IMPALA-10310: Fix couldn't skip rows in parquet file on NextRowGroup

guojingfeng has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16697 )

Change subject: IMPALA-10310: Fix couldn't skip rows in parquet file on NextRowGroup
......................................................................

IMPALA-10310: Fix couldn't skip rows in parquet file on NextRowGroup

In practice we recommend that hdfs block size should align with parquet
row group size.But in fact some compute engine like spark, default
parquet row group size is 128MB, and if ETL user doesn't change the
default property spark will generate row groups that smaller than hdfs
block size. The result is a single hdfs block may contain multiple
parquet row groups.

In planner stage, length of impala generated scan range may be bigger
than row group size. thus a single scan range contains multiple row
group. In current parquet scanner when move to next row group, some of
internal stat in parquet column readers need to reset.
eg: num_buffered_values_, column chunk metadata, reset internal stat of
column chunk readers. But current_row_range_ offset is not reset
currently, this will cause errors
"Couldn't skip rows in file hdfs://xxx" as IMPALA-10310 points out.

This patch simply reset current_row_range_ to 0 when moving into next
row group in parquet column readers. Fix the bug IMPALA-10310.

Testing:
* Add e2e test for parquet multi blocks per file and multi pages
  per block
* Ran all core tests offline.
* Manually tested all cases encountered in my production environment.

Change-Id: I964695cd53f5d5fdb6485a85cd82e7a72ca6092c
---
M be/src/exec/parquet/parquet-column-readers.cc
M testdata/data/README
A testdata/data/customer_multiblock_page_index.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test
M tests/query_test/test_parquet_stats.py
5 files changed, 34 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/16697/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16697
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I964695cd53f5d5fdb6485a85cd82e7a72ca6092c
Gerrit-Change-Number: 16697
Gerrit-PatchSet: 3
Gerrit-Owner: guojingfeng <gu...@tencent.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: guojingfeng <gu...@tencent.com>