You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2021/10/21 16:26:53 UTC

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17960


Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................

IMPALA-10777: Enable min/max filtering for Iceberg partitions

This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.

Performance

I used the following query to measure perf on a scale 10 TPC-DS
dataset:

 select i_item_id,sum(ss_ext_sales_price) total_sales
 from
         store_sales,
         date_dim,
          customer_address,
          item
 where i_item_id in (select
      i_item_id
 from item
 where i_color in ('orchid','chiffon','lace'))
  and     ss_item_sk              = i_item_sk
  and     ss_sold_date_sk         = d_date_sk
  and     d_year                  = 2000
  and     d_moy                   = 1
  and     ss_addr_sk              = ca_address_sk
  and     ca_gmt_offset           = -8

The above query took the following times to execute:

Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s

Testing:
 * added e2e test
 * planner test could not be added because Iceberg tables behave
   differently during planner tests (due to some hacks that needs
   refactoring)

Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
6 files changed, 44 insertions(+), 5 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 5: Code-Review+2

(1 comment)

Looks great!

+1 for myself. Make it +2 to carry Tamas' +1.

http://gerrit.cloudera.org:8080/#/c/17960/5/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1460
PS5, Line 1460: const RuntimeFilter* filter = GetFilter(filter_idx);
              :     MinMaxFilter* minmax_filter = GetMinMaxFilter(filter);
This is very nice.



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 13:22:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Tamas Mate (Code Review)" <ge...@cloudera.org>.
Tamas Mate has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 4: Code-Review+1

Thanks for the update, LGTM!


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 10:46:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7590/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 21:11:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7587/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 13:10:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tamas Mate, Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17960

to look at the new patch set (#3).

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................

IMPALA-10777: Enable min/max filtering for Iceberg partitions

This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.

Performance

I used the following query to measure perf on a scale 10 TPC-DS
dataset:

 select i_item_id,sum(ss_ext_sales_price) total_sales
 from
         store_sales,
         date_dim,
          customer_address,
          item
 where i_item_id in (select
      i_item_id
 from item
 where i_color in ('orchid','chiffon','lace'))
  and     ss_item_sk              = i_item_sk
  and     ss_sold_date_sk         = d_date_sk
  and     d_year                  = 2000
  and     d_moy                   = 1
  and     ss_addr_sk              = ca_address_sk
  and     ca_gmt_offset           = -8

The above query took the following times to execute:

Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s

Testing:
 * added e2e test
 * planner test could not be added because Iceberg tables behave
   differently during planner tests (due to some hacks that needs
   refactoring)

Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/runtime-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
8 files changed, 80 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS3, Line 678: // We skip row group filtering if the
> I'm using the RuntimeFilter because minmax filter doesn't have the filter d
Done


http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238: 
> Currently we use this field for partition and non-partition columns as well
Yeah, that is a good name.



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 16:25:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 4:

(3 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS3, Line 678: // We skip row group filtering if the
> nit. This call probably can wait after minmax_filter is obtained at line 68
I'm using the RuntimeFilter because minmax filter doesn't have the filter descriptor.


http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238: 
> +1. Sounds like a good idea.
Currently we use this field for partition and non-partition columns as well. How about isColumnInDataFile?


http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@429
PS3, Line 429: SET RUNTIME_FILTER_WAIT_TIME_MS=$RUNTIME_FILTER_WAIT_TIME_MS;
> Missing 
I added it, though it's not really needed since we don't check the results here, only the plan.



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 13:22:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 3:

(2 comments)

Looks great!

http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS3, Line 678: if (!IsDataInDataFile(idx)) continue;
nit. This call probably can wait after minmax_filter is obtained at line 680, at which point we can directly call

minmax_filter->IsDataInDataFile(GetScanNodeId()).


http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238: isDataInDataFile
> nit: this was a bit ambiguous for me and had to read the comment of the isD
+1. Sounds like a good idea.



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 19:58:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Nov 2021 03:26:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9653/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 15:59:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9636/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 16:48:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 1:

(2 comments)

Looks great!

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
I wonder if FE is in the better position to make the decision.


http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java
File fe/src/main/java/org/apache/impala/catalog/FeTable.java:

http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java@116
PS1, Line 116: isPartitionTransformColumn
nit. May rename it as isComputedPartitioColumn() as the computed column concept is more general and applicable to other table format (in the future).



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 19:06:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> In this patch we set is_bound_by_parttion_columns for the computed partitio
I got it. Thanks for the explanation. 

So in this case, maybe we can add a new field to TRuntimeFilterTargetDesc: is_data_in_data_file. 

At line 678 in this file, we do the test as follows.

  if (IsBoundByPartitionColumn(idx) && !IsDataInDataFile(idx)) {
      continue;
    }

 93                                                                                            
 94 // Specification of a runtime filter target.                                               
 95 struct TRuntimeFilterTargetDesc {                                                          
 96   // Target node id                                                                        
 97   1: Types.TPlanNodeId node_id                                                             
 98                                                                                            
 99   // Expr on which the filter is applied                                                   
100   2: required Exprs.TExpr target_expr                                                      
101                                                                                            
102   // Indicates if 'target_expr' is bound only by partition columns                         
103   3: required bool is_bound_by_partition_columns                                           
104                                                                                            
105   // Slot ids on which 'target_expr' is bound on                                           
106   4: required list<Types.TSlotId> target_expr_slotids                                      
107                                                                                            
108   // Indicates if this target is on the same fragment as the join that                     
109   // produced the runtime filter                                                           
110   5: required bool is_local_target                                                         
111                                                                                            
112   // If the target node is a Kudu scan node, the name, in the case it appears in Kudu, and 
113   // type of the targeted column.                                                          
114   6: optional string kudu_col_name                                                         
115   7: optional Types.TColumnType kudu_col_type;                                             
116                                                                                            
117   // The low and high value as seen in the column stats of the targeted column.            
118   8: optional Data.TColumnValue low_value                                                  
119   9: optional Data.TColumnValue high_value                                                 
120                                                                                            
121   // Indicates if the low and high value in column stats are present                       
122   10: optional bool is_min_max_value_present                                               
123 }


http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS2, Line 678: && IsSimplePartitionedTable()
> For simple partitioned tables we don't want to evaluate the filters at the 
make sense. Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 16:45:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 2:

(2 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> I wonder if FE is in the better position to make the decision.
Yeah, good question how could we make it a bit more elegant. Should I add a field to THdfsTable, e.g. 'has_computed_partitions'?

Maybe as long as we only have computed partitions + Parquet in Iceberg tables we can keep it as it is. Or do you have a better suggestion?


http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java
File fe/src/main/java/org/apache/impala/catalog/FeTable.java:

http://gerrit.cloudera.org:8080/#/c/17960/1/fe/src/main/java/org/apache/impala/catalog/FeTable.java@116
PS1, Line 116: 
> nit. May rename it as isComputedPartitioColumn() as the computed column con
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 09:53:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Tamas Mate (Code Review)" <ge...@cloudera.org>.
Tamas Mate has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 3:

(2 comments)

Hi Zoltan,
Added a readability nit and a test comment, apart from these LGTM.

http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/17960/3/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@238
PS3, Line 238: isDataInDataFile
nit: this was a bit ambiguous for me and had to read the comment of the isDataInDataFile method to understand it. What do you think about using something like: isPartitionColumnValuesInDataFile, isPartitionValuesInDataFile or isPartColValInDataFile


http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17960/3/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@429
PS3, Line 429: select * from functional_parquet.iceberg_partitioned i1,
Missing 
 > SET RUNTIME_FILTER_WAIT_TIME_MS=$RUNTIME_FILTER_WAIT_TIME_MS;



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 16:02:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tamas Mate, Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17960

to look at the new patch set (#2).

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................

IMPALA-10777: Enable min/max filtering for Iceberg partitions

This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.

Performance

I used the following query to measure perf on a scale 10 TPC-DS
dataset:

 select i_item_id,sum(ss_ext_sales_price) total_sales
 from
         store_sales,
         date_dim,
          customer_address,
          item
 where i_item_id in (select
      i_item_id
 from item
 where i_color in ('orchid','chiffon','lace'))
  and     ss_item_sk              = i_item_sk
  and     ss_sold_date_sk         = d_date_sk
  and     d_year                  = 2000
  and     d_moy                   = 1
  and     ss_addr_sk              = ca_address_sk
  and     ca_gmt_offset           = -8

The above query took the following times to execute:

Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s

Testing:
 * added e2e test
 * planner test could not be added because Iceberg tables behave
   differently during planner tests (due to some hacks that needs
   refactoring)

Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
6 files changed, 45 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9642/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 10:14:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7587/


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 19:13:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................

IMPALA-10777: Enable min/max filtering for Iceberg partitions

This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.

Performance

I used the following query to measure perf on a scale 10 TPC-DS
dataset:

 select i_item_id,sum(ss_ext_sales_price) total_sales
 from
         store_sales,
         date_dim,
          customer_address,
          item
 where i_item_id in (select
      i_item_id
 from item
 where i_color in ('orchid','chiffon','lace'))
  and     ss_item_sk              = i_item_sk
  and     ss_sold_date_sk         = d_date_sk
  and     d_year                  = 2000
  and     d_moy                   = 1
  and     ss_addr_sk              = ca_address_sk
  and     ca_gmt_offset           = -8

The above query took the following times to execute:

Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s

Testing:
 * added e2e test
 * planner test could not be added because Iceberg tables behave
   differently during planner tests (due to some hacks that needs
   refactoring)

Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Reviewed-on: http://gerrit.cloudera.org:8080/17960
Reviewed-by: Qifan Chen <qc...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/runtime-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
9 files changed, 89 insertions(+), 22 deletions(-)

Approvals:
  Qifan Chen: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> Yeah, good question how could we make it a bit more elegant. Should I add a
Ideally, is_computed_column can be a new boolean field to TColumn, together with an expression on how the computation is done, such EXTRACT(DAY FROM timestamp_col).

For this patch, looks like we may not need that at all if we can set is_bound_by_partition_columns properly in FE even for the computed column. 

120   bool IsBoundByPartitionColumn(int plan_id) const {                                       
121     int target_ndx = filter_desc().planid_to_target_ndx.at(plan_id);                       
122     return filter_desc().targets[target_ndx].is_bound_by_partition_columns;                
123   }


http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS2, Line 678: && IsSimplePartitionedTable()
Since a computed partition column is a partition column, I wonder if there is a need to make a test here?



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 13:41:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 1:

(1 comment)

Thanks for the comments.

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> I got it. Thanks for the explanation. 
Added is_data_in_file to TRuntimeFilterTargetDesc.



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 15:39:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tamas Mate, Qifan Chen, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17960

to look at the new patch set (#4).

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................

IMPALA-10777: Enable min/max filtering for Iceberg partitions

This patch enables min/max filters for Iceberg columns that
participate in table partitioning. The min/max filters are
evaluated at the Parquet row group level. This means that it
is still slower than dynamic partition pruning (which doesn't
even need to open the files), but much faster than no pruning at all.

Performance

I used the following query to measure perf on a scale 10 TPC-DS
dataset:

 select i_item_id,sum(ss_ext_sales_price) total_sales
 from
         store_sales,
         date_dim,
          customer_address,
          item
 where i_item_id in (select
      i_item_id
 from item
 where i_color in ('orchid','chiffon','lace'))
  and     ss_item_sk              = i_item_sk
  and     ss_sold_date_sk         = d_date_sk
  and     d_year                  = 2000
  and     d_moy                   = 1
  and     ss_addr_sk              = ca_address_sk
  and     ca_gmt_offset           = -8

The above query took the following times to execute:

Regular Parquet table: 1.16s
Iceberg table without min/max filters: 4.39s
Iceberg table with min/max filters: 1.77s

Testing:
 * added e2e test
 * planner test could not be added because Iceberg tables behave
   differently during planner tests (due to some hacks that needs
   refactoring)

Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
---
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/runtime-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/FeTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
9 files changed, 89 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17960/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9677/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 13:45:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10777: Enable min/max filtering for Iceberg partitions

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17960 )

Change subject: IMPALA-10777: Enable min/max filtering for Iceberg partitions
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1323
PS1, Line 1323: if (scan_node_->hdfs_table()->IsIcebergTable()) return false;
> Ideally, is_computed_column can be a new boolean field to TColumn, together
In this patch we set is_bound_by_parttion_columns for the computed partition columns.


http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17960/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@678
PS2, Line 678: && IsSimplePartitionedTable()
> Since a computed partition column is a partition column, I wonder if there 
For simple partitioned tables we don't want to evaluate the filters at the row group level as they are not stored in the parquet files. Hence we 'continue' here.

But for computed partitions we want to evaluate the filters since the columns are stored in the data files.



-- 
To view, visit http://gerrit.cloudera.org:8080/17960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I51b53188c6da7eeebfeae385e1de31ace0980cac
Gerrit-Change-Number: 17960
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 15:25:59 +0000
Gerrit-HasComments: Yes