You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Gabor Kaszab (Code Review)" <ge...@cloudera.org> on 2023/02/24 10:55:58 UTC

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/19534


Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only
filter on partition columns then Iceberg can filter the files based
on this and using these filter in Impala wouldn't filter any more
rows from the output (taking into account that no partition evolution)
was performed on the table.

An additional benefit of not down no-op predicates is that we can
potentially materialize less slots. For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 290 insertions(+), 65 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/1
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#5).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on partition columns then Iceberg can
filter the files and using these filters in Impala wouldn't filter any
more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 311 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/5
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
> We don't have the query rewrite yet, but probably we still get the Parquet 
You're right, the Parquet scanner goes into the "count * optimisation path" and the query is answered from stats if there are no predicates pushed down for a count * query. I added a test to cover this.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64:     super(id, tblRef.getDesc(), conjuncts,
> Though if we have the following case:
Ahh, good point. No worries, I'll re-retire 'nonIdentityConjuncts_' then :)


http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@503
PS9, Line 503: anyMatch
> Nit: instead of negating anyMatch, you could use noneMatch.
Done
Update: Anyway, I restored this function to the original version so this is not relevant anymore.



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 14 Apr 2023 07:26:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 16: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Apr 2023 15:22:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@16
PS1, Line 16: t (ass
> Nit: filters.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17
PS1, Line 17: ).
> Wouldn't "assuming" be better?
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17
PS1, Line 17: 
> Nit: closing parenthesis should come at the end of the sentence.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20
PS1, Line 20: scanner is that we can potentially materialize les
> Am I right that here we mean pushing down predicates to the scanner, as opp
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20
PS1, Line 20: tentiall
> not pushing down
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@28
PS1, Line 28: 
> We could include a Testing section describing what tests have been added/mo
Done


http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG@20
PS3, Line 20: iall
> nit: pushing down?
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1390
PS3, Line 1390: to
> 'is' is not needed here?
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2027
PS3, Line 2027: indentPrefix
> 'explainLevel' may not be the best name as it could be confused with the ve
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: 
> The comment on 'nonIdentityConjuncts_' says that it is a subset of 'conjunc
'nonIdentityConjuncts_' is in fact a subset of 'conjuncts_' that doesn't include conjuncts on identity partitioned columns.
'skippedConjuncts_' indeed contains identity conjuncts but not necessarily all of them just the ones that won't filter any further rows.

OTOH As I see 'nonIdentityConjuncts_' is used for cardinality estimates. However, I wonder if from this point on can we drop use 'conjuncts_' - 'skippedConjuncts_' for this purpose.
@Zoltan, do you have a thought on this?


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@115
PS3, Line 115: has 
> Nit: has
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@241
PS3, Line 241: 
> I don't think we need this as this is the DELETE SCAN node which wouldn't e
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@388
PS3, Line 388: : %s", getIceTable().
> nit: could be 'Collections.emptyList()'
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql@3265
PS3, Line 3265: ---- DEPENDENT_LOAD
              : # We can use 'date_string_col' as it is once IMPALA-11954 is done.
              : INSERT INTO {db_name}{db_suffix}.iceberg_partition_evolution
> Since we are writing the table via Impala we probably don't need to set the
Indeed, done


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@49
PS3, Line 49: could no
> Could not be?
Indeed :)


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@990
PS3, Line 990: event_time='2020-01-01 11:00:00';
> iceberg_partitioned is partitioned by HOUR(event_time), how can we skip pus
I had to check, but apparently we push down predicates of an HOUR() partition to Iceberg. Found this in the logs:
"Push down the predicate: ref(name="event_time") == 1577872800000000 to iceberg"
I'm wondering if this is because the partition value contains date+hour and not just hour. "event_time_hour=2020-01-01-08"

In this particular test (due to some timezone mismatch) Iceberg filters out all the files using this pushed down filter, hence it shows up in "skipped Iceberg predicates".


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test@51
PS3, Line 51: predicat
> Nit: predicate
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Mar 2023 15:49:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 9: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
> That count * optimization to get rid of the SCAN nodes when we eliminate al
We don't have the query rewrite yet, but probably we still get the Parquet count-star optimization, i.e. a count-star slot is created and the scanners only read the footer, they create a single tuple per Parquet row group which holds the row count, i.e. no need to read the Parquet def/rep-levels, etc.

Or what does the EXPLAIN say? Can we add a planner test for count(*)?



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Mar 2023 14:50:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#13).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 328 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/13
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12487/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Feb 2023 12:55:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 3:

(5 comments)

Left some minor comments, but really nice work!

http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG@20
PS3, Line 20: down
nit: pushing down?


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@241
PS3, Line 241: getSkippedConjuncts()
I don't think we need this as this is the DELETE SCAN node which wouldn't evaluate any conjuncts anyway.


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@388
PS3, Line 388: new ArrayList<Expr>()
nit: could be 'Collections.emptyList()'


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql@3265
PS3, Line 3265: TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
              :               'iceberg.catalog_location'='/test-warehouse/iceberg_test/hadoop_catalog',
              :               'iceberg.table_identifier'='functional_parquet.iceberg_partition_evolution');
Since we are writing the table via Impala we probably don't need to set these.


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@990
PS3, Line 990: event_time='2020-01-01 11:00:00';
iceberg_partitioned is partitioned by HOUR(event_time), how can we skip pushing it down?



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Mar 2023 17:06:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12798/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 14 Apr 2023 07:46:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12827/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Apr 2023 09:22:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#14).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 331 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/14
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 1:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@16
PS1, Line 16: filter
Nit: filters.


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17
PS1, Line 17: taking into account
Wouldn't "assuming" be better?


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17
PS1, Line 17: )
Nit: closing parenthesis should come at the end of the sentence.


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20
PS1, Line 20: An additional benefit of not down no-op predicates
Am I right that here we mean pushing down predicates to the scanner, as opposed to pushing it down to Iceberg? We could make this clearer, perhaps also in the title.


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20
PS1, Line 20: not down
not pushing down


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@28
PS1, Line 28: 
We could include a Testing section describing what tests have been added/modified.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2014
PS1, Line 2014: getDerivedExplainString
You could add a comment describing how this method should be used.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64:   private List<Expr> skippedConjuncts;
What is the relationship between 'conjuncts_', 'nonIdentityConjuncts_' and 'skippedConjuncts_'? Is 'conjuncts_' the union of the other two? In this case 'skippedConjuncts_' may not need to be stored (or taken in the constructor) separately.

Or, if the 'skippedConjuncts_' have already been removed from 'conjuncts_', the comment on L56 has become a bit misleading.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: skippedConjuncts
Should end with an underscore: 'skippedConjuncts_'.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@114
PS1, Line 114: canSkipPushingDownIcebergPredicates
Should end with underscore.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@114
PS1, Line 114: canSkipPushingDownIcebergPredicates
Could make it clearer that here we mean pushing predicates down to the scanners, not to Iceberg.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@375
PS1, Line 375:       conjuncts_.removeAll(impalaIcebergPredicates_);
Wouldn't it be possible that some (but not all) conjuncts can be skipped? Can this approach be further refined (in a follow-up commit)?


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@379
PS1, Line 379:   private List<Expr> getSkippedConjuncts() {
Nit: if we forgot to call filterConjuncts() but called filterFileDescriptors(), it's possible that 'conjuncts_' would still contain the elements that getSkippedConjuncts() returns.
If we had a separate member for 'skippedConjuncts_' that would be returned by getSkippedConjuncts() and would be populated in filterConjuncts(), this anomaly would not be possible.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1279
PS1, Line 1279: exercising
Nit: excercising.


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@1
PS1, Line 1: in
Nit: no need for 'in'.


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@2
PS1, Line 2: are
Nit: out.


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@52
PS1, Line 52: action
Did you mean 'event_time'?


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test@51
PS1, Line 51: pre-
Do we need "pre"?


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test@79
PS1, Line 79: group
Nit: could be corrected to "two row groups".


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test@176
PS1, Line 176: w
Nit: capital W.


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test@147
PS1, Line 147: results
Nit: results in ...



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Feb 2023 17:03:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#11).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 341 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/11
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#6).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on partition columns then Iceberg can
filter the files and using these filters in Impala wouldn't filter any
more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 311 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/6
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12704/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Mar 2023 07:29:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12435/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Feb 2023 11:15:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 3:

I've just realized that I forgot to address the comments on the commit message. Let me wait for further comments and will get back to them together.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Mar 2023 12:41:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 8: Code-Review+1

(4 comments)

LGTM!

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@14
PS8, Line 14: partition columns
nit: IDENTITY-partition columns?


http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
And in this example we not just materialize less slots, but we also get count-star optimization.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: 
> 'nonIdentityConjuncts_' is in fact a subset of 'conjuncts_' that doesn't in
Yeah, I think 'nonIdentityConjuncts_' can retire with this change, and it's more precise to use 'conjuncts_', as it is already filtered by 'IcebergScanPlanner.filterConjuncts()'.


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@990
PS3, Line 990: event_time='2020-01-01 11:00:00';
> I had to check, but apparently we push down predicates of an HOUR() partiti
I see, thanks for checking.



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Mar 2023 13:08:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9242/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Apr 2023 10:10:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19534/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2026
PS2, Line 2026:   protected String getDerivedExplainString(String explainLevel, TExplainLevel detailLevel) {
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/19534/2/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/2/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@226
PS2, Line 226:   protected String getDerivedExplainString(String explainLevel, TExplainLevel detailLevel) {
line too long (92 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Feb 2023 12:35:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#7).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on partition columns then Iceberg can
filter the files and using these filters in Impala wouldn't filter any
more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 309 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/7
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Reviewed-on: http://gerrit.cloudera.org:8080/19534
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 331 insertions(+), 75 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 9: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175
PS8, Line 175: h(ge
> Unrelated to this patch and most probably would require fixing tests as wel
Ok, I probably looked at it when I compared different versions with a rebase between them.


http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@503
PS9, Line 503: anyMatch
Nit: instead of negating anyMatch, you could use noneMatch.



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Mar 2023 09:18:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12815/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 Apr 2023 15:40:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 15:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9238/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Apr 2023 12:43:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12488/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Feb 2023 13:01:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#3).

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only
filter on partition columns then Iceberg can filter the files based
on this and using these filter in Impala wouldn't filter any more
rows from the output (taking into account that no partition evolution)
was performed on the table.

An additional benefit of not down no-op predicates is that we can
potentially materialize less slots. For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 307 insertions(+), 68 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/3
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#9).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 313 insertions(+), 113 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/9
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64:   private List<Expr> skippedConjuncts;
> I dropped it then.
Though if we have the following case:

 part_id_col = 42 AND non_part_col < 10

(let's assume table is IDENTITY-partitioned by part_id_col)

Then Iceberg is able to filter out partitions via part_id_col = 42, but we cannot skip evaluating part_id_col = 42 in the scanners because there is still a residual expression (non_part_col < 10).

And now, if 'part_id_col = 42' will still be used to calculate the cardinality estimates, and it will probably cause very bad estimations.

So maybe we can only retire 'nonIdentityConjuncts_' when have a better logic to determine which predicate is part of the Iceberg residual and which isn't.

Sorry for the inconvenience.



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Mar 2023 09:54:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 15: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9238/


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Apr 2023 18:03:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@63
PS12, Line 63: #
nit: did you want to add some comments here?


http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@64
PS12, Line 64: SELECT COUNT(*) FROM functional_parquet.iceberg_partitioned WHERE action = 'click';
Are we sure we get Parquet count(*) optimization here?

Because I see the following plan in such a case:

  Query: explain SELECT COUNT(*) FROM functional_parquet.iceberg_partitioned where true
 +----------------------------------------------------------------------------------+
 | Explain String                                                                   |
 +----------------------------------------------------------------------------------+
 | Max Per-Host Resource Reservation: Memory=8.00KB Threads=2                       |
 | Per-Host Resource Estimates: Memory=10MB                                         |
 | Codegen disabled by planner                                                      |
 |                                                                                  |
 | PLAN-ROOT SINK                                                                   |
 | |                                                                                |
 | 01:AGGREGATE [FINALIZE]                                                          |
 | |  output: sum_init_zero(functional_parquet.iceberg_partitioned.stats: num_rows) |
 | |  row-size=8B cardinality=1                                                     |
 | |                                                                                |
 | 00:SCAN HDFS [functional_parquet.iceberg_partitioned]                            |
 |    HDFS partitions=1/1 files=20 size=22.90KB                                     |
 |    row-size=8B cardinality=20                                                    |
 +----------------------------------------------------------------------------------+

Please note the sum_init_zero aggregate function, plus the row-size of SCAN HDFS is not 0.

I get a similar plan to yours when I run a select count(*) query on a plain text table that doesn't have this optimization, e.g.: explain SELECT COUNT(*) FROM functional.alltypes;



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 14 Apr 2023 13:30:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 13:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@63
PS12, Line 63: #
> nit: did you want to add some comments here?
Done


http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@64
PS12, Line 64: # optimization kicks in for Parquet scanners.
> Are we sure we get Parquet count(*) optimization here?
You're right. When I ran this query from shell it worked as expected and I got a similar profile you sent. However, when running in PlannerTests this went wrong and I got the profile you see in the tests.

It turned out that this is because initially HdfsScanNOde.fileFormats_ contains "PARQUET" and "ICEBERG" and then this is updated to "PARQUET" only. However, the PlannerTests don't do this update step and won't do the count(*) optimization as it's only for single-FF tables.

I made a small adjustment to not store "ICEBERG" in HdfsScanNode.fileFormats_. Apparently, this solved the issue. Hope won't cause any regression but a full teat suite will show.



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 Apr 2023 15:27:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 15: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Apr 2023 12:43:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12701/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Mar 2023 16:09:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 3:

(7 comments)

Thanks Gabor. Could you also address the comments on the commit message?

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1390
PS3, Line 1390: is
'is' is not needed here?


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2027
PS3, Line 2027: explainLevel
'explainLevel' may not be the best name as it could be confused with the verbosity level of the explain query ('detailLevel' here) - we can use "set EXPLAIN_LEVEL=..." in the shell.
'indentPrefix' would be better.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: 
> 'skippedConjuncts_' is already removed from 'conjuncts_' in the Iceberg pla
The comment on 'nonIdentityConjuncts_' says that it is a subset of 'conjuncts_' that does not include the conjuncts pushed down to Iceberg - isn't the latter set the same as 'skippedConjuncts_'?

That's why I thought that the union of 'nonIdentityConjuncts_' and 'skippedConjuncts_' would be the original value of 'conjuncts_'. But then after removing 'skippedConjuncts_' from 'conjuncts_', 'nonIdentityConjuncts_' is not a (proper) subset of 'conjuncts_' but the same as 'conjuncts_'. Is this not correct?


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@115
PS3, Line 115: have
Nit: has


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1279
PS1, Line 1279: exercising
> I googled it, and for me it seems that it's 'exercising' :)
You're right, sorry :)


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@49
PS3, Line 49: could be
Could not be?


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test@51
PS3, Line 51: predicae
Nit: predicate



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 03 Mar 2023 13:28:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................


Patch Set 3:

(15 comments)

Thanks for the review, Daniel! I hope I addresses all of your comments.

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2014
PS1, Line 2014: ort(formats);
> You could add a comment describing how this method should be used.
Sure. I also noticed that if the derived class return a multi-line string then the indentation would be off as I only shift the first line at L1970 when I append to 'detailPrefix'.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: 
> What is the relationship between 'conjuncts_', 'nonIdentityConjuncts_' and 
'skippedConjuncts_' is already removed from 'conjuncts_' in the Iceberg planner. I think the comment on L56 is still valid after introducing this new member.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: 
> Should end with an underscore: 'skippedConjuncts_'.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@114
PS1, Line 114: ther we have to push down 'impalaIc
> Should end with underscore.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@114
PS1, Line 114: ther we have to push down 'impalaIc
> Could make it clearer that here we mean pushing predicates down to the scan
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@375
PS1, Line 375:           "Failed to load data files for Iceberg table: %s", getIceTable().getFullName()),
> Wouldn't it be possible that some (but not all) conjuncts can be skipped? C
Good obervation. It's possible that a subset of 'icebergPredicates_' could be omitted from pushing down in Impala. This is planned in a future patch as it's not 100% trivial how to match between an Iceberg predicate (returned by the residual() function call) and Impala's predicates. At first also it's worth running a benchmark to see if we gain noticeable performance with this.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@379
PS1, Line 379:   }
> Nit: if we forgot to call filterConjuncts() but called filterFileDescriptor
Well, in that case our Planner tests would start failing because some predicates would be present both in the 'predicates' and 'skipped Iceberg predicates' section of the explain output.
With this approach we either push down all the 'Iceberg' predicates to the scanners or we don't push any of them so introducing another member that would be identical to 'impalaIcebergPredicates_' seems overkill to me. In the second go where we might push down a subset of these predicates we might want to introduce what you suggest.
But I wouldn't worry about forgetting to call filterConjuncts() because the PlannerTests would catch it.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@1279
PS1, Line 1279: exercising
> Nit: excercising.
I googled it, and for me it seems that it's 'exercising' :)


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@1
PS1, Line 1: bo
> Nit: no need for 'in'.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@2
PS1, Line 2:  th
> Nit: out.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@52
PS1, Line 52: 
> Did you mean 'event_time'?
Well, this test case is not what I intended. I re-wrote it a bit so that we can have 2 filters, both of them on partition columns where one is identity and the other is truncate partitioned.


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test@51
PS1, Line 51: pred
> Do we need "pre"?
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test@79
PS1, Line 79: group
> Nit: could be corrected to "two row groups".
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test@176
PS1, Line 176: W
> Nit: capital W.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test:

http://gerrit.cloudera.org:8080/#/c/19534/1/testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test@147
PS1, Line 147: results
> Nit: results in ...
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Feb 2023 12:40:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#2).

Change subject: IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only
filter on partition columns then Iceberg can filter the files based
on this and using these filter in Impala wouldn't filter any more
rows from the output (taking into account that no partition evolution)
was performed on the table.

An additional benefit of not down no-op predicates is that we can
potentially materialize less slots. For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 305 insertions(+), 68 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/2
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12700/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Mar 2023 15:57:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 8: Code-Review+1

(2 comments)

Looks good to me.

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175
PS8, Line 175: file
We could write "file(s)".


http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@371
PS8, Line 371: file
See L175.



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Mar 2023 13:01:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 14: Code-Review+2

LGTM!


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Apr 2023 12:18:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#12).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 341 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/12
-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 16: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Apr 2023 10:10:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@14
PS8, Line 14: IDENTITY-partitio
> nit: IDENTITY-partition columns?
Done


http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
> And in this example we not just materialize less slots, but we also get cou
That count * optimization to get rid of the SCAN nodes when we eliminate all the predicates by this change is coming in "Part2"


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64:     super(id, tblRef.getDesc(), conjuncts,
> Yeah, I think 'nonIdentityConjuncts_' can retire with this change, and it's
I dropped it then.


http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175
PS8, Line 175: h(ge
> We could write "file(s)".
Unrelated to this patch and most probably would require fixing tests as well. We can do this in a separate ticket.


http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@371
PS8, Line 371: ses;
> See L175.
see my above comment.



-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Mar 2023 14:36:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/12716/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Mar 2023 14:15:13 +0000
Gerrit-HasComments: No