You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2021/10/21 16:26:53 UTC
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17960
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
IMPALA-10777: Enable min/max filtering for Iceberg partitions
This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.
Performance
I used the following query to measure perf on a scale 10 TPC-DS
dataset:
select i_item_id,sum(ss_ext_sales_price) total_sales
from
store_sales,
date_dim,
customer_address,
item
where i_item_id in (select
i_item_id
from item
where i_color in ('orchid','chiffon','lace'))
and ss_item_sk = i_item_sk
and ss_sold_date_sk = d_date_sk
and d_year = 2000
and d_moy = 1
and ss_addr_sk = ca_address_sk
and ca_gmt_offset = -8
The above query took the following times to execute:
Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s
Testing:
* added e2e test
* planner test could not be added because Iceberg tables behave
differently during planner tests (due to some hacks that needs
refactoring)
Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
6 files changed, 44 insertions(+), 5 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/1
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 5: Code-Review+2
(1 comment)
Looks great!
+1 for myself. Make it +2 to carry Tamas' +1.
http://gerrit.cloudera.org:8080/#/c/17960/5/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1460
PS5, Line 1460: const RuntimeFilter* filter = GetFilter(filter_idx);
: MinMaxFilter* minmax_filter = GetMinMaxFilter(filter);
This is very nice.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 13:22:42 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Tamas Mate (Code Review)" <ge...@cloudera.org>.
Tamas Mate has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 4: Code-Review+1
Thanks for the update, LGTM!
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 10:46:53 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 5:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7590/ DRY_RUN=false
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 21:11:11 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 5:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7587/ DRY_RUN=false
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 13:10:12 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tamas Mate, Qifan Chen, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/17960
to look at the new patch set (#3).
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
IMPALA-10777: Enable min/max filtering for Iceberg partitions
This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.
Performance
I used the following query to measure perf on a scale 10 TPC-DS
dataset:
select i_item_id,sum(ss_ext_sales_price) total_sales
from
store_sales,
date_dim,
customer_address,
item
where i_item_id in (select
i_item_id
from item
where i_color in ('orchid','chiffon','lace'))
and ss_item_sk = i_item_sk
and ss_sold_date_sk = d_date_sk
and d_year = 2000
and d_moy = 1
and ss_addr_sk = ca_address_sk
and ca_gmt_offset = -8
The above query took the following times to execute:
Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s
Testing:
* added e2e test
* planner test could not be added because Iceberg tables behave
differently during planner tests (due to some hacks that needs
refactoring)
Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/runtime-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
8 files changed, 80 insertions(+), 10 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/3
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 4:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS3, Line 678: // We skip row group filtering if the
> I'm using the RuntimeFilter because minmax filter doesn't have the filter d
Done
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238:
> Currently we use this field for partition and non-partition columns as well
Yeah, that is a good name.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 16:25:57 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 4:
(3 comments)
Thanks for the comments!
http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS3, Line 678: // We skip row group filtering if the
> nit. This call probably can wait after minmax_filter is obtained at line 68
I'm using the RuntimeFilter because minmax filter doesn't have the filter descriptor.
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238:
> +1. Sounds like a good idea.
Currently we use this field for partition and non-partition columns as well. How about isColumnInDataFile?
http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:
http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@429
PS3, Line 429: SET RUNTIME_FILTER_WAIT_TIME_MS=$RUNTIME_FILTER_WAIT_TIME_MS;
> Missing
I added it, though it's not really needed since we don't check the results here, only the plan.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 13:22:03 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 3:
(2 comments)
Looks great!
http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS3, Line 678: if (!IsDataInDataFile(idx)) continue;
nit. This call probably can wait after minmax_filter is obtained at line 680, at which point we can directly call
minmax_filter->IsDataInDataFile(GetScanNodeId()).
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238: isDataInDataFile
> nit: this was a bit ambiguous for me and had to read the comment of the isD
+1. Sounds like a good idea.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 19:58:42 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 5: Verified+1
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Nov 2021 03:26:26 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 3:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/9653/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 15:59:38 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 1:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/9636/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 16:48:13 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 1:
(2 comments)
Looks great!
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
I wonder if FE is in the better position to make the decision.
http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java
File fe/src/main/java/org/apache/impala/catalog/FeTable.java:
http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java@116
PS1, Line 116: isPartitionTransformColumn
nit. May rename it as isComputedPartitioColumn() as the computed column concept is more general and applicable to other table format (in the future).
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 19:06:25 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 2:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> In this patch we set is_bound_by_parttion_columns for the computed partitio
I got it. Thanks for the explanation.
So in this case, maybe we can add a new field to TRuntimeFilterTargetDesc: is_data_in_data_file.
At line 678 in this file, we do the test as follows.
if (IsBoundByPartitionColumn(idx) && !IsDataInDataFile(idx)) {
continue;
}
93
94 // Specification of a runtime filter target.
95 struct TRuntimeFilterTargetDesc {
96 // Target node id
97 1: Types.TPlanNodeId node_id
98
99 // Expr on which the filter is applied
100 2: required Exprs.TExpr target_expr
101
102 // Indicates if 'target_expr' is bound only by partition columns
103 3: required bool is_bound_by_partition_columns
104
105 // Slot ids on which 'target_expr' is bound on
106 4: required list<Types.TSlotId> target_expr_slotids
107
108 // Indicates if this target is on the same fragment as the join that
109 // produced the runtime filter
110 5: required bool is_local_target
111
112 // If the target node is a Kudu scan node, the name, in the case it appears in Kudu, and
113 // type of the targeted column.
114 6: optional string kudu_col_name
115 7: optional Types.TColumnType kudu_col_type;
116
117 // The low and high value as seen in the column stats of the targeted column.
118 8: optional Data.TColumnValue low_value
119 9: optional Data.TColumnValue high_value
120
121 // Indicates if the low and high value in column stats are present
122 10: optional bool is_min_max_value_present
123 }
http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS2, Line 678: && IsSimplePartitionedTable()
> For simple partitioned tables we don't want to evaluate the filters at the
make sense. Done
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 16:45:12 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 2:
(2 comments)
Thanks for the comments!
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> I wonder if FE is in the better position to make the decision.
Yeah, good question how could we make it a bit more elegant. Should I add a field to THdfsTable, e.g. 'has_computed_partitions'?
Maybe as long as we only have computed partitions + Parquet in Iceberg tables we can keep it as it is. Or do you have a better suggestion?
http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java
File fe/src/main/java/org/apache/impala/catalog/FeTable.java:
http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java@116
PS1, Line 116:
> nit. May rename it as isComputedPartitioColumn() as the computed column con
Done
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 09:53:34 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Tamas Mate (Code Review)" <ge...@cloudera.org>.
Tamas Mate has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 3:
(2 comments)
Hi Zoltan,
Added a readability nit and a test comment, apart from these LGTM.
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:
http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238: isDataInDataFile
nit: this was a bit ambiguous for me and had to read the comment of the isDataInDataFile method to understand it. What do you think about using something like: isPartitionColumnValuesInDataFile, isPartitionValuesInDataFile or isPartColValInDataFile
http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:
http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@429
PS3, Line 429: select * from functional_parquet.iceberg_partitioned i1,
Missing
> SET RUNTIME_FILTER_WAIT_TIME_MS=$RUNTIME_FILTER_WAIT_TIME_MS;
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 16:02:12 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tamas Mate, Qifan Chen, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/17960
to look at the new patch set (#2).
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
IMPALA-10777: Enable min/max filtering for Iceberg partitions
This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.
Performance
I used the following query to measure perf on a scale 10 TPC-DS
dataset:
select i_item_id,sum(ss_ext_sales_price) total_sales
from
store_sales,
date_dim,
customer_address,
item
where i_item_id in (select
i_item_id
from item
where i_color in ('orchid','chiffon','lace'))
and ss_item_sk = i_item_sk
and ss_sold_date_sk = d_date_sk
and d_year = 2000
and d_moy = 1
and ss_addr_sk = ca_address_sk
and ca_gmt_offset = -8
The above query took the following times to execute:
Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s
Testing:
* added e2e test
* planner test could not be added because Iceberg tables behave
differently during planner tests (due to some hacks that needs
refactoring)
Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
6 files changed, 45 insertions(+), 5 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/2
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 2:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/9642/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 10:14:30 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 5: Verified-1
Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7587/
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 19:13:57 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
IMPALA-10777: Enable min/max filtering for Iceberg partitions
This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.
Performance
I used the following query to measure perf on a scale 10 TPC-DS
dataset:
select i_item_id,sum(ss_ext_sales_price) total_sales
from
store_sales,
date_dim,
customer_address,
item
where i_item_id in (select
i_item_id
from item
where i_color in ('orchid','chiffon','lace'))
and ss_item_sk = i_item_sk
and ss_sold_date_sk = d_date_sk
and d_year = 2000
and d_moy = 1
and ss_addr_sk = ca_address_sk
and ca_gmt_offset = -8
The above query took the following times to execute:
Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s
Testing:
* added e2e test
* planner test could not be added because Iceberg tables behave
differently during planner tests (due to some hacks that needs
refactoring)
Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Reviewed-on: http://gerrit.cloudera.org:8080/17960
Reviewed-by: Qifan Chen <qc...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/runtime-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
9 files changed, 89 insertions(+), 22 deletions(-)
Approvals:
Qifan Chen: Looks good to me, approved
Impala Public Jenkins: Verified
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 2:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> Yeah, good question how could we make it a bit more elegant. Should I add a
Ideally, is_computed_column can be a new boolean field to TColumn, together with an expression on how the computation is done, such EXTRACT(DAY FROM timestamp_col).
For this patch, looks like we may not need that at all if we can set is_bound_by_partition_columns properly in FE even for the computed column.
120 bool IsBoundByPartitionColumn(int plan_id) const {
121 int target_ndx = filter_desc().planid_to_target_ndx.at(plan_id);
122 return filter_desc().targets[target_ndx].is_bound_by_partition_columns;
123 }
http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS2, Line 678: && IsSimplePartitionedTable()
Since a computed partition column is a partition column, I wonder if there is a need to make a test here?
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 13:41:36 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 1:
(1 comment)
Thanks for the comments.
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> I got it. Thanks for the explanation.
Added is_data_in_file to TRuntimeFilterTargetDesc.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 15:39:49 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tamas Mate, Qifan Chen, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/17960
to look at the new patch set (#4).
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
IMPALA-10777: Enable min/max filtering for Iceberg partitions
This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.
Performance
I used the following query to measure perf on a scale 10 TPC-DS
dataset:
select i_item_id,sum(ss_ext_sales_price) total_sales
from
store_sales,
date_dim,
customer_address,
item
where i_item_id in (select
i_item_id
from item
where i_color in ('orchid','chiffon','lace'))
and ss_item_sk = i_item_sk
and ss_sold_date_sk = d_date_sk
and d_year = 2000
and d_moy = 1
and ss_addr_sk = ca_address_sk
and ca_gmt_offset = -8
The above query took the following times to execute:
Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s
Testing:
* added e2e test
* planner test could not be added because Iceberg tables behave
differently during planner tests (due to some hacks that needs
refactoring)
Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/runtime-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
9 files changed, 89 insertions(+), 22 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/4
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 4:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/9677/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 13:45:08 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )
Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................
Patch Set 2:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> Ideally, is_computed_column can be a new boolean field to TColumn, together
In this patch we set is_bound_by_parttion_columns for the computed partition columns.
http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS2, Line 678: && IsSimplePartitionedTable()
> Since a computed partition column is a partition column, I wonder if there
For simple partitioned tables we don't want to evaluate the filters at the row group level as they are not stored in the parquet files. Hence we 'continue' here.
But for computed partitions we want to evaluate the filters since the columns are stored in the data files.
--
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 15:25:59 +0000
Gerrit-HasComments: Yes