You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2021/09/26 11:37:46 UTC

[Impala-ASF-CR] IMPALA-10873: Push down EQUALS, IS NULL and IN-list predicate to ORC reader

Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17815

to look at the new patch set (#3).

Change subject: IMPALA-10873: Push down EQUALS, IS NULL and IN-list predicate to ORC reader
......................................................................

IMPALA-10873: Push down EQUALS, IS NULL and IN-list predicate to ORC reader

This patch pushs down more kinds of predicates into the ORC reader,
including EQUALS, IN-list, and IS-NULL predicates to have more
improvements:
 - EQUALS and IN-list predicates can be evaluated inside the ORC reader
   with bloom filters in the ORC files.
 - Comparing to scanning parquet that converting an IN-list predicate
   into two binary predicates (i.e. LE and GE), the ORC reader can
   leverage IN-list predicates to skip ORC RowGroups. E.g. a RowGroup
   with int column 'x' in range [1, 100] will be skipped if we push down
   predicate "x in (0, 101)".
 - IS-NULL predicates (including IS-NOT-NULL) can also be used in the
   ORC reader to skip RowGroups.

Implementation:
FE will collect these kinds of predicates into 'min_max_conjuncts' of
THdfsScanNode. To better reflect the meaning, 'min_max_conjuncts' is
renamed to 'stats_conjuncts'. Same for other related variable names.
Parquet scanner will only pick binary min-max conjuncts (i.e. LT, GT,
LE, and GE) to keep the existing behavior. ORC scanner will build
SearchArgument based on these conjuncts.

Tests
 * Add test in orc-stats.test

Change-Id: Iaa89f080fe2e87d94fc8ea7f1be83e087fa34225
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/orc-stats.test
9 files changed, 555 insertions(+), 153 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/17815/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17815
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaa89f080fe2e87d94fc8ea7f1be83e087fa34225
Gerrit-Change-Number: 17815
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>