You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Riza Suminto (Code Review)" <ge...@cloudera.org> on 2022/03/16 22:50:11 UTC

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Riza Suminto has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18327


Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(count) star optimization is disabled for the ACID table
because the scanner might need to read and validate the
'currentTransaction' column. This patch also drops 'parquet' from names
related to the count star optimization.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
12 files changed, 626 insertions(+), 47 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 1
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8014/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 04:25:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10341/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 06:02:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10298/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 1
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Mar 2022 23:09:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@279
PS2, Line 279:     if (vector.get_value('table_format').file_format != 'text' or
             :           vector.get_value('table_format').compression_codec != 'none'):
> This test constraint, along with parquet and kudu ones, is a bit misleading
Looking again, the core exploration of this test only have single 'text/none' format dimension (create_uncompressed_text_dimension in line 183). Therefore, this misleading constraint is needed to execute the test just two times (codegen vs non-codegen), even in exhaustive exploration.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Mar 2022 18:12:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/18327/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/2//COMMIT_MSG@18
PS2, Line 18: 'currentTransaction' column. This patch also drops 'parquet' from names
nit. in table's special schema.


http://gerrit.cloudera.org:8080/#/c/18327/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/2/be/src/exec/hdfs-orc-scanner.cc@797
PS2, Line 797: int
uint64_t?


http://gerrit.cloudera.org:8080/#/c/18327/2/be/src/exec/hdfs-orc-scanner.cc@804
PS2, Line 804:  TupleRow* dst_row = row_batch->GetRow(row_batch->AddRow());
I wonder if we can compute one tuple only for all the stripes available from reader_ by adding all row count stats together. In this way, we can avoid allocating multiple tuples. 

If I read the new code correctly, the current logic produces one tuple for each stripe.


http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@355
PS2, Line 355: isAcidTable
nit. missing suffix _.


http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@400
PS2, Line 400: Parquet
nit. remove?


http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@406
PS2, Line 406: (!hasOrc(fileFormats) || isAcidTable)
nit. This implies the acid table property is not checked against parquet table.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Mar 2022 13:51:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#9).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
25 files changed, 916 insertions(+), 218 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/9
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10389/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 11
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 03 Apr 2022 23:35:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 14:

Patch set 14 modify test_file_parser.py to allow multiple format in RUNTIME_PROFILE section and reduce duplicate lines.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 14
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 00:13:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10390/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 04:36:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 10: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18327/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test:

http://gerrit.cloudera.org:8080/#/c/18327/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test@91
PS10, Line 91: aggregation(SUM, NumFileMetadataRead): 6
here and at other test files:
Can you also keep aggregation(SUM, NumRowGroups): 0
That would clearly check that the optimization works (otherwise I don't get the benefit of checking NumFileMetadataRead, it is just the number of files written by the insert)



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 10
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 02 Apr 2022 08:25:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18327/1/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/1/tests/query_test/test_aggregation.py@280
PS1, Line 280:  
flake8: E129 visually indented line with same indent as next logical line



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 1
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Mar 2022 22:51:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 3:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG@16
PS1, Line 16: CID table
> for full ACID tables
Done


http://gerrit.cloudera.org:8080/#/c/18327/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/2//COMMIT_MSG@18
PS2, Line 18: 'currentTransaction' column in table's special schema.
> nit. in table's special schema.
Done


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@772
PS1, Line 772: Status HdfsOrcScanner::GetNextWithCountStarOptimization(RowBatch* row_batch) {
> This method is more than 100 lines now. Could you extract the count-star co
Done


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@786
PS1, Line 786:   TupleRow* dst_row = row_batch->GetRow(row_batch->AddRow());
             :   dst_row->SetTuple(0, dst_tuple);
> Each scanner thread processes a split which could be part of the file. Only
Patch set 3 serve count(star) just from footer scan range.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@798
PS1, Line 798:   // selected field from the file (in
> Should I just drop the timer, or move it up to cover ResizeAndAllocateTuple
I decide to drop it for optimized count(star) code path.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@818
PS1, Line 818:       DCHECK_LT(stripe_rows_read_, file_rows);
             :       int64_t rows_remaining = file_rows - 
> This is a special case that only the scanner of the footer split will proce
Yes, this stay the same in patch set 3. We still serve it just from footer scan range.


http://gerrit.cloudera.org:8080/#/c/18327/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/2/be/src/exec/hdfs-orc-scanner.cc@797
PS2, Line 797: che
> uint64_t?
Done


http://gerrit.cloudera.org:8080/#/c/18327/2/be/src/exec/hdfs-orc-scanner.cc@804
PS2, Line 804: / This is an optimized count(*) case.
> I wonder if we can compute one tuple only for all the stripes available fro
Patch set 3 serve count(star) just from footer scan range (1 row per file instead of per stripe).


http://gerrit.cloudera.org:8080/#/c/18327/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/18327/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@334
PS1, Line 334: isFullAcidT
> I think that it should contain "full" to make the semantics clearer.
Done


http://gerrit.cloudera.org:8080/#/c/18327/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@405
PS1, Line 405:     if (fileFormats.size() != 1) return false;
> Not sure if this worth the effort, but mixed ORC / Parquet tables could use
Technically, we should be able to. But in practice, we will not have such situation, right?
It will require different SerDe to handle both file format simlutaneously, and I have not see Impala deal with mixed file format before.
Please correct me if I'm wrong. Otherwise, I prefer to keep the old behavior.


http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@355
PS2, Line 355: isFullAcidT
> nit. missing suffix _.
Done


http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@400
PS2, Line 400: count(*
> nit. remove?
Done


http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@406
PS2, Line 406: se;
> nit. This implies the acid table property is not checked against parquet ta
Currently full acid table only available with ORC. But it looks like Parquet can support full acid schema too in the future:
https://issues.apache.org/jira/browse/HIVE-8123
I updated this logic in patch set 3.


http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@4
PS2, Line 4: functional_orc_def.uncomp_src_alltypes
> This table is a managed table. I'm not sure why its table properties don't 
This table follow schema from functional.alltypes, but without "transactional=true" property.
https://github.com/apache/impala/blob/c10e951/testdata/datasets/functional/functional_schema_template.sql#L2559
Thus, it is a managed ORC, but not full acid.

I think it is okay to use this table. This will align with parquet-stats-agg.test that use table derived from functional.alltypes as well and exercise similar set of test cases.


http://gerrit.cloudera.org:8080/#/c/18327/1/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/1/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@100
PS1, Line 100: =====
> For Parquet we also have a multi block test:
Yes, this is exercised in TestOrc::test_misaligned_orc_stripes for ORC.
Added count(star) test there as well.


http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@279
PS2, Line 279:     if (vector.get_value('table_format').file_format != 'text' or
             :           vector.get_value('table_format').compression_codec != 'none'):
> Shouldn't we only run this on ORC? It seems the other tests for parquet and
This test constraint, along with parquet and kudu ones, is a bit misleading. It constraint to 'text/none' table format only, but actually exercise the orc/parquet/kudu testcase.
We should probably fix the constraint for these tests to better reflect what they're testing against.


http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@283
PS2, Line 283: 
> It seems we don't need 'unique_database' since all tables we used are in fu
Dropped the 'unique_database' for test_orc_count_star_optimization and test_kudu_count_star_optimization.
test_parquet_count_star_optimization still need 'unique_database' since it has custom test case that creates and insert new table:
https://github.com/apache/impala/blob/408b0aa/testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test#L119-L125



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Mar 2022 17:56:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18327/7/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/7/be/src/exec/hdfs-orc-scanner.cc@807
PS7, Line 807: "select 
> nit: Could you change this to "select 1"?
Done


http://gerrit.cloudera.org:8080/#/c/18327/7/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/18327/7/tests/custom_cluster/test_query_retries.py@81
PS7, Line 81:  w
> nit: double spaces
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 01:22:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18327/9/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/9/be/src/exec/hdfs-columnar-scanner.cc@319
PS9, Line 319:   *dst_slot = num_rows;
> I think that we can remove this, there is no real need to be backward compa
Replaced it with NumFileMetadataRead.


http://gerrit.cloudera.org:8080/#/c/18327/9/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/9/tests/query_test/test_aggregation.py@260
PS9, Line 260: test_ndv(self):
> I agree with moving count_star_optimization tests to a different class.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 10
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Apr 2022 21:22:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 12:

> Patch Set 12: Verified-1
> 
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8014/

This failed some tests added by IMPALA-5861, which specifically address previous issue about inaccurate "RowsRead" counter over zero slot scan. Thinking to replace them with "RowsReturned".


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 15:57:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#15).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M testdata/workloads/functional-query/queries/QueryTest/mixed-format.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters_mt_dop.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
31 files changed, 1,046 insertions(+), 256 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/15
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 15
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18327/14/tests/util/test_file_parser.py
File tests/util/test_file_parser.py:

http://gerrit.cloudera.org:8080/#/c/18327/14/tests/util/test_file_parser.py@264
PS14, Line 264: ,
flake8: W602 deprecated form of raising exception



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 14
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 00:11:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 3:

(1 comment)

One more thing to check. 

Sorry I did not mention it yesterday.

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@82
PS3, Line 82: cardinality=4.39K
We may need to compute the cardinality differently for count star optimization since the actual value should be # of files. Please refer to HfdsScanNode.computeCardinalities(). 

With the modified cardinality, a serial plan should be sufficient even for a very large table.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Mar 2022 18:32:37 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10384/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 10
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Apr 2022 21:39:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#11).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
25 files changed, 980 insertions(+), 242 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/11
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 11
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10395/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 13
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 22:44:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 16: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 16
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 13:27:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#6).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation by serving the
result just from the file's footer stats for both Parquet and ORC.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
22 files changed, 761 insertions(+), 95 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/6
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#5).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation by serving the
result just from the file's footer stats for both Parquet and ORC.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
21 files changed, 743 insertions(+), 106 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/5
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 5:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.h@435
PS3, Line 435: 
> nit. Handle count(*) queries by reading the row count
Done


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc@780
PS3, Line 780:   *dst_slot = num_rows;
> Can you unify this with the Parquet implementation? If I didn't miss someth
Done


http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc@776
PS4, Line 776:   DCHECK_LE(stripe_rows_read_, num_rows);
> Could you comment that only the scanner of the footer split will run in thi
Added the comment in GetNextInternal. Reading it again later and realized it is still a little bit confusing.
Might need to reorder the branches.


http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc@806
PS4, Line 806: can_node_->IsZeroSlotTableScan()) {
> I think count(*) won't go here now. If there are any conjuncts for the coun
We can still get here for count(*) against full acid table. Added example in the comment.


http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc@807
PS4, Line 807:       // This is an unoptimized count(*) case.
> Could you comment that only the scanner of the footer split will run in thi
Added the comment in GetNextInternal.


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc@846
PS3, Line 846: !scan_node->ReadsFileMetadataOnly() || footer_split == split) {
> Having a function like "readsOnlyMetadata()" could be more descriptive.
Done


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc@847
PS3, Line 847:         ScanRangeMetadata* split_metadata =
> Shouldn't we also check row_batches_need_validation_ like in HdfsOrcScanner
We don't need to. The reason is this logic is a more like a prediction rather than a guarantee.
row_batches_need_validation_ can not be concluded until we read the file. Therefore, it is possible that footer scanner fallback to full scan behavior even when we thought we'll read file metadata only here.
Added related logging in hdfs-orc-scanner.cc.


http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@82
PS3, Line 82: cardinality=24
> We may need to compute the cardinality differently for count star optimizat
Modified the cardinality count in HdfsScanNode.java.
However, the PlannerTest does not seem to validate the cardinality part of the planner. Changed them anyway in patch set 5.


http://gerrit.cloudera.org:8080/#/c/18327/4/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/4/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@5
PS4, Line 5: from functional_orc_def.uncomp_src_alltypes
> Could you add a test to cover the old optimization (ie. zero slot table sca
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 26 Mar 2022 18:43:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc@776
PS4, Line 776:   int capacity = 1;
> Added the comment in GetNextInternal. Reading it again later and realized i
Fixed the comments and reorder the branches


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc@847
PS3, Line 847:           || footer_split == split) {
> I'm having a second thought here. I think row_batches_need_validation_ can 
Added RequireRowValidation check in ReadsFileMetadataOnly.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 04:43:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 3: Code-Review+1

(4 comments)

Looks very good!

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.h@435
PS3, Line 435: Handles count(*) queries, writing only 
nit. Handle count(*) queries by reading the row count


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc@806
PS3, Line 806: GetNextWithCountStarOptimization
Very cool!


http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/18327/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@406
PS2, Line 406: se;
> Currently full acid table only available with ORC. But it looks like Parque
Done


http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test:

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test@22
PS3, Line 22: aggregation(SUM, NumRowGroups): 0
nit. Can you please explain why there is the change?



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Mar 2022 22:37:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 7:

Patch set 7 fix clang-tidy errors.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 05:44:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#8).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation by serving the
result just from the file's footer stats for both Parquet and ORC.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
22 files changed, 767 insertions(+), 101 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/8
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 12:

patch set 12 is a rebase


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 04:18:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 9: Code-Review+1

(2 comments)

Thanks for merging more of the ORC and Parquet code!

http://gerrit.cloudera.org:8080/#/c/18327/9/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/9/be/src/exec/hdfs-columnar-scanner.cc@319
PS9, Line 319:   COUNTER_ADD(scan_node_->rows_read_counter(), num_rows);
> We might want to remove this counter increment to avoid confusion.
I think that we can remove this, there is no real need to be backward compatible in profile IMO

To express that we have read the row group / stripe metadata, we could have counters like NumRowGroupsMetadataRead - seeing these new NumRowGroups would express it clearly that reading the metadata was enough for row groups.


http://gerrit.cloudera.org:8080/#/c/18327/9/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/9/tests/query_test/test_aggregation.py@260
PS9, Line 260: test_parquet_count_star_optimization
I agree with moving count_star_optimization tests to a different class.

> Some are also slow, I suspect because they need to create unique_database first, only to be skipped later.

I never thought about this - I can imagine unique_database creation becoming a bottleneck in some cases during test runs, as it needs extra insert+delete in the database behind HMS. Moving a bunch of pytest.skip() to @SkipIf could be a great ramp up task.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Apr 2022 12:40:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#3).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation by serving the
result just from the file's footer stats for both Parquet and ORC.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
15 files changed, 648 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/3
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10302/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Mar 2022 16:42:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#2).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(count) star optimization is disabled for the ACID table
because the scanner might need to read and validate the
'currentTransaction' column. This patch also drops 'parquet' from names
related to the count star optimization.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
12 files changed, 626 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 4:

(1 comment)

Patch set 4 is a rebase

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test:

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test@22
PS3, Line 22: aggregation(SUM, NumRowGroups): 0
> nit. Can you please explain why there is the change?
Previously, we are still iterating row groups and read stripe stats to serve count(star).
Patch set 3 change this by reading footer stats only. Hence, NumRowGroups now becomes 0 because we're iterating none of it.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 25 Mar 2022 03:57:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 7: Code-Review+1

(2 comments)

LGTM.

http://gerrit.cloudera.org:8080/#/c/18327/7/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/7/be/src/exec/hdfs-orc-scanner.cc@807
PS7, Line 807: count(*)
nit: Could you change this to "select 1"?


http://gerrit.cloudera.org:8080/#/c/18327/7/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/18327/7/tests/custom_cluster/test_query_retries.py@81
PS7, Line 81:   
nit: double spaces



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 09:04:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 11:

Hi Riza, could you resolve the merge conflict?


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 11
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 02:54:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 12: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 04:25:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 15: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 15
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 08:50:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8023/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 16
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 08:56:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@772
PS1, Line 772: Status HdfsOrcScanner::GetNextInternal(RowBatch* row_batch) {
This method is more than 100 lines now. Could you extract the count-star code into a separated method?


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@786
PS1, Line 786:       // We try to allocate a smaller row batch here because in most cases the number
             :       // stripes in a file is much lower than the default row batch capacity.
> I tried to do one row per file instead of per stripe, but failed the follow
Each scanner thread processes a split which could be part of the file. Only stripes whose mid point locates inside the split will be processed by the scanner (picked in NextStripe()):
https://github.com/apache/impala/blob/1739edf2d97009062cb339a3c276f01dbd4d33bd/be/src/exec/hdfs-orc-scanner.cc#L903-L910

I think the missing part of returning one row per file is here:
https://github.com/apache/impala/blob/1739edf2d97009062cb339a3c276f01dbd4d33bd/be/src/exec/hdfs-scanner.cc#L841-L846
The same as the IsZeroSlotTableScan() case, we just need the scanner of the footer split.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@818
PS1, Line 818:       uint64_t file_rows = reader_->getNumberOfRows();
             :       if (stripe_rows_read_ == file_rows) {
> We didn't change this logic, but it seems a bit suspicious to me: will thes
This is a special case that only the scanner of the footer split will process the whole file:
https://github.com/apache/impala/blob/1739edf2d97009062cb339a3c276f01dbd4d33bd/be/src/exec/hdfs-scanner.cc#L841-L846


http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@4
PS2, Line 4: functional_orc_def.uncomp_src_alltypes
This table is a managed table. I'm not sure why its table properties don't have the transaction settings. But I think we need another table that is non-transactional, e.g. in tpch_orc_def.


http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@279
PS2, Line 279:     if (vector.get_value('table_format').file_format != 'text' or
             :           vector.get_value('table_format').compression_codec != 'none'):
Shouldn't we only run this on ORC? It seems the other tests for parquet and kudu also run on other formats, which I think is redudant.


http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@283
PS2, Line 283: unique_database
It seems we don't need 'unique_database' since all tables we used are in functional_orc_def.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 19 Mar 2022 12:53:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG@16
PS1, Line 16: ACID table
for full ACID tables


http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG@18
PS1, Line 18: 'currentTransaction' column. 
If there is no delete file in an ACID table, then we should be able to still use the optimization, right? I am ok with not doing it in this change, but a ticket could be created to improve (full) ACID performance. My understanding is that FULL acid is probably the biggest reason why someone would use Impala with ORC.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@786
PS1, Line 786:       // We try to allocate a smaller row batch here because in most cases the number
             :       // stripes in a file is much lower than the default row batch capacity.
Why do we add one row per stripe? We could add them together and always add a single row.
This would mean that this function will be only called once per split, so we can simply set eos_ to true before returning.

(I also don't get why the Parquet scanner creates multiple rows)


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@798
PS1, Line 798:         assemble_rows_timer_.Start();
I think that starting the timer is not too useful here as this logic should take only minimal time.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@818
PS1, Line 818:       uint64_t file_rows = reader_->getNumberOfRows();
             :       if (stripe_rows_read_ == file_rows) {
We didn't change this logic, but it seems a bit suspicious to me: will these work if the ORC file has multiple HDFS block, so multiple scanners will process the same file?

stripe_rows_read_ doesn't seem to be increased when we skip a stripe in NextStripe(), so my guess is that  stripe_rows_read_ will never reach file_rows in multi block ORC files.


http://gerrit.cloudera.org:8080/#/c/18327/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/18327/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@334
PS1, Line 334: isAcidTable
I think that it should contain "full" to make the semantics clearer.
+ member variables should end with "_"


http://gerrit.cloudera.org:8080/#/c/18327/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@405
PS1, Line 405:     if (fileFormats.size() != 1) return false;
Not sure if this worth the effort, but mixed ORC / Parquet tables could use count star optimization, right?


http://gerrit.cloudera.org:8080/#/c/18327/1/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/1/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@100
PS1, Line 100: =====
For Parquet we also have a multi block test:
https://github.com/apache/impala/blob/master/testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test#L107

I don't know whether we have multiblock tables available for ORC, but it would be generally great to test those too with count(star)



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Mar 2022 16:22:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#4).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation by serving the
result just from the file's footer stats for both Parquet and ORC.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
15 files changed, 647 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/4
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 5:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/10339/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 26 Mar 2022 18:48:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#7).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

IMPALA-5036 added optimization for count(star) in Parquet scans that
avoid materializing dummy rows. This change provides similar
optimization for ORC tables. We use the stripes num rows statistics when
computing the count star instead of materializing empty rows. The
aggregate function changed from a count to a special sum function
initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation by serving the
result just from the file's footer stats for both Parquet and ORC.

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
22 files changed, 766 insertions(+), 100 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/7
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@181
PS8, Line 181: 07K
             : ====
> I think it is genuine that count star optimization is not applied for for b
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 31 Mar 2022 21:21:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 9:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@9
PS8, Line 9: This patch provides count(star) optimization for ORC scans, similar to
           : the work done in IMPALA-5036 fo
> nit. This patch provides count(star) optimization for ORC scans, similar to
Done


http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@23
PS8, Line 23: into HdfsColumnarScanner.
> can you add some benchmark results? I think that the difference should be v
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc@807
PS8, Line 807: 
             :   // Process next stripe if current stripe is drained. Each stripe will generate several
             :   // orc batches. We only advance the stripe after processing the last batch.
             :   // 'advance_stripe_' is updated in 'NextStripe', meaning the current stripe we advance
             :   // to can be skip. 'end_of_stripe_' marks whether current stripe is drained. It's only
             :   // set to true in 'AssembleRows'.
             :   while (advance_group_ || end_of_stripe_) {
             :     // The next stripe will use a new dictionary blob so transfer the memory to row_batch.
             :     row_batch->tuple_data_pool()->AcquireData(dictionary_pool_.get(), false);
             :     context_->ReleaseCompletedResources(/* done */ true);
             :     // Commit the rows to flush the row batch from the previous stripe.
             :     RETURN_IF_ERROR(CommitRows(0, row_batch));
             : 
             :     RETURN_IF_ERROR(NextStripe());
             :     DCHECK_LE(group_idx_, reader_->getNumberOfStripes());
             :     if (group_idx_ == reader_->getNumberOfStripes()) {
             :       eos_ = true;
             :       DCHECK(parse_status_.ok());
             :       return Status::OK();
             :     }
             :   }
             : 
             :   // Apply any runtime filters to static tuples containing the partition keys for this
             :   // partition. If any filter fails, we return immediately and stop processing this
             :   /
> It would be also nice to unify this with the Parquet version - my understan
Done. I decide to name it 'rows_read_in_group_'.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@3
PS8, Line 3: # a text table, so the optimization is not applied. The optimization is observed when
> nit. May add a comment:
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@126
PS8, Line 126: 
> nit. is
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@168
PS8, Line 168: 
> nit. it can not be applied to the 1st aggregate function.
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@181
PS8, Line 181: 07K
             : ====
> nit. The optimization does apply to the inner count(*).
I think it is genuine that count star optimization is not applied for for both 'count(*)' in this case.
The cardinality=24 of "00:SCAN HDFS" is coincidence from other optimization and does not come from this patch.
I confirmed this by running this query in build without this patch.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@277
PS8, Line 277: 
> nit. all predicates are on partition columns only.
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@386
PS8, Line 386:    row-size=0B cardinality=13.07K
> nit. in general.
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@9
PS8, Line 9: bigint
> Don't we have a similar counter as for Parquet in the profile?
Done. Added profile counter verification.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 31 Mar 2022 18:53:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18327/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test:

http://gerrit.cloudera.org:8080/#/c/18327/10/testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test@91
PS10, Line 91: aggregation(SUM, NumRowGroups): 0
> here and at other test files:
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 11
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 03 Apr 2022 23:15:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#10).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
25 files changed, 975 insertions(+), 242 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/10
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 10
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 15:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10397/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 15
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 01:42:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 6:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/10340/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 04:59:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 4:

(6 comments)

The patch looks pretty good now!

http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc@776
PS4, Line 776:   int64_t num_rows = static_cast<int64_t>(reader_->getNumberOfRows());
Could you comment that only the scanner of the footer split will run in this case? Also mention we have the special logics in HdfsScanner::IssueFooterRanges().


http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc@806
PS4, Line 806: This is an unoptimized count(*) case.
I think count(*) won't go here now. If there are any conjuncts for the count(*), we will need to materialize some slots thus it's not a zero slot table scan. I think we should use the comment at line 796 (move it here) and change the example to "select 1" over the table.


http://gerrit.cloudera.org:8080/#/c/18327/4/be/src/exec/hdfs-orc-scanner.cc@807
PS4, Line 807:       // Insert 'num_to_commit' template tuples into 'row_batch'.
Could you comment that only the scanner of the footer split will run in this case? Also mention we have the special logics in HdfsScanner::IssueFooterRanges().


http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/2/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@4
PS2, Line 4: functional_orc_def.uncomp_src_alltypes
> This table follow schema from functional.alltypes, but without "transaction
I see. I thought managed tables can only be transactional but that's wrong. Double checked that the file schema is non-transactional. Thanks for the explanation!


http://gerrit.cloudera.org:8080/#/c/18327/4/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/4/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@5
PS4, Line 5: from functional_orc_def.uncomp_src_alltypes
Could you add a test to cover the old optimization (ie. zero slot table scan)? E.g.

  select 1 from functional_orc_def.alltypestiny


http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/2/tests/query_test/test_aggregation.py@279
PS2, Line 279:     if (vector.get_value('table_format').file_format != 'text' or
             :           vector.get_value('table_format').compression_codec != 'none'):
> Looking again, the core exploration of this test only have single 'text/non
Thanks for looking into this! I think it worths a comment here to save time of other developers.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 26 Mar 2022 09:52:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10350/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 01:41:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@23
PS8, Line 23: 
can you add some benchmark results? I think that the difference should be visible in queries like count(*) from lineitem

Parquet scanning may also become faster due to smaller estimates leading to skipping codegen


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc@780
PS3, Line 780:   int64_t num_rows = GetNumberOfRowsInFile();
> Done
Most of the logic still seems to be duplicated - this function seems to be nearly the same as the corresponding logic in Parquet.


http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc@807
PS8, Line 807:     // There are no materialized slots, e.g. "select 1" over the table.  We can serve
             :     // this query from just the file metadata.  We don't need to read the column data.
             :     // Only scanner of the footer split will run in this case. See the logic in
             :     // HdfsScanner::IssueFooterRanges() and HdfsScanNodeBase::ReadsFileMetadataOnly().
             :     // We might also get here for count(*) query against full acid table such as:
             :     // "select count(*) from functional_orc_def.alltypes;"
             :     DCHECK(is_footer_scanner_);
             :     int64_t file_rows = GetNumberOfRowsInFile();
             :     if (stripe_rows_read_ == file_rows) {
             :       eos_ = true;
             :       return Status::OK();
             :     }
             :     assemble_rows_timer_.Start();
             :     DCHECK_LT(stripe_rows_read_, file_rows);
             :     int64_t rows_remaining = file_rows - stripe_rows_read_;
             :     int max_tuples = min<int64_t>(row_batch->capacity(), rows_remaining);
             :     TupleRow* current_row = row_batch->GetRow(row_batch->AddRow());
             :     int num_to_commit = WriteTemplateTuples(current_row, max_tuples);
             :     Status status = CommitRows(num_to_commit, row_batch);
             :     assemble_rows_timer_.Stop();
             :     RETURN_IF_ERROR(status);
             :     stripe_rows_read_ += max_tuples;
             :     COUNTER_ADD(scan_node_->rows_read_counter(), num_to_commit);
             :     return Status::OK();
             :   }
It would be also nice to unify this with the Parquet version - my understanding is that the only difference is using stripe_rows_read_ vs row_group_rows_read_. A shared member could be used like row_group_or_stripe_rows_read_ in HdfsColumnarScanner


http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/3/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@82
PS3, Line 82: cardinality=24
> Modified the cardinality count in HdfsScanNode.java.
Can you mention this as an "addittional change" in the commit message? Currently it suggests that only ORC behavior was changed.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@9
PS8, Line 9: bigint
Don't we have a similar counter as for Parquet in the profile?
---- RUNTIME_PROFILE
aggregation(SUM, NumRowGroups): 0



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 12:15:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10372/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 31 Mar 2022 19:05:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 9:

(2 comments)

There are two place that I'm still unsure about:

http://gerrit.cloudera.org:8080/#/c/18327/9/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/9/be/src/exec/hdfs-columnar-scanner.cc@319
PS9, Line 319:   COUNTER_ADD(scan_node_->rows_read_counter(), num_rows);
We might want to remove this counter increment to avoid confusion.
This counter imply that we're iterating rows when in fact we're not.
Similarly as the one in GetNextWithTemplateTuple().


http://gerrit.cloudera.org:8080/#/c/18327/9/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/18327/9/tests/query_test/test_aggregation.py@260
PS9, Line 260: test_parquet_count_star_optimization
I think I want to take this test along with test_kudu_count_star_optimization, test_orc_count_star_optimization, and test_sampled_ndv into a standalone class.
The reason is, they are really intended to just run once, and currently we enforce that by calling pytest.skip().
If I run TestAggregationQueries with exhaustive exploration, I will see several lines of skipped tests. Some are also slow, I suspect because they need to create unique_database first, only to be skipped later.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Apr 2022 03:20:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10324/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 22 Mar 2022 17:55:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-orc-scanner.cc@780
PS3, Line 780:   int64_t num_rows = static_cast<int64_t>(reader_->getNumberOfRows());
Can you unify this with the Parquet implementation? If I didn't miss something then the only difference is in getting the number of rows - there could be a virtual function like HdfsColumnarScanner::GetNumberOfRowsInFile()


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc@846
PS3, Line 846: !(scan_node->IsZeroSlotTableScan() || scan_node->optimize_count_star())
Having a function like "readsOnlyMetadata()" could be more descriptive.


http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc@847
PS3, Line 847:           || footer_split == split) {
Shouldn't we also check row_batches_need_validation_ like in HdfsOrcScanner::GetNextInternal?



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Mar 2022 15:01:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10333/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 25 Mar 2022 04:15:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10396/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 14
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 00:30:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 2:

(3 comments)

Thanks for the review, Csaba! I'll work on them.
Replying some that I remember now.

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/1//COMMIT_MSG@18
PS1, Line 18: 'currentTransaction' column. 
> If there is no delete file in an ACID table, then we should be able to stil
I honestly haven't dig deep into this. My purpose is just to avoid optimization when row_batches_need_validation_ == true
https://github.com/apache/impala/blob/4aeffe7/be/src/exec/hdfs-orc-scanner.cc#L385-L396

It looks like we need to check that all files of the table is already compacted, and I'm not sure if we can check that in front end.
I'm leaning towards filing separate follow-up JIRA since it require us to craft compacted vs uncompacted ACID table case.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@786
PS1, Line 786:       // We try to allocate a smaller row batch here because in most cases the number
             :       // stripes in a file is much lower than the default row batch capacity.
> Why do we add one row per stripe? We could add them together and always add
I tried to do one row per file instead of per stripe, but failed the following test

impala-py.test --table_formats "orc/def/block" query_test/test_scanners.py::TestTpchScanRangeLengths::test_tpch_scan_ranges

I think that test scenario when one scanner is assigned only parts of the file?  Some scanner share a file but work on different stripes?
Therefore, I fallback to how parquet scanner do this, which is one row per row group.


http://gerrit.cloudera.org:8080/#/c/18327/1/be/src/exec/hdfs-orc-scanner.cc@798
PS1, Line 798:         assemble_rows_timer_.Start();
> I think that starting the timer is not too useful here as this logic should
Should I just drop the timer, or move it up to cover ResizeAndAllocateTupleBuffer as well?



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Mar 2022 16:54:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 8: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 02:06:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/3/be/src/exec/hdfs-scanner.cc@847
PS3, Line 847:         ScanRangeMetadata* split_metadata =
> We don't need to. The reason is this logic is a more like a prediction rath
I'm having a second thought here. I think row_batches_need_validation_ can and should be checked early here. I'll look around some more.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 26 Mar 2022 20:23:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 8: Code-Review+1

(7 comments)

Looks great!

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@9
PS8, Line 9: IMPALA-5036 added optimization for count(star) in Parquet scans that
           : avoid materializing dummy rows.
nit. This patch provides count(star) optimization for ORC scans, similar to the work done in IMPALA-5036 for Parquet scans.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@3
PS8, Line 3: # a text table, so the optimization is not applied.
nit. May add a comment:

The optimization is observed when the cardinality of the ORC scan (24) is the same as the # of files (24).


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@126
PS8, Line 126: should be
nit. is


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@168
PS8, Line 168: there are two aggregate functions
nit. it can not be applied to the 1st aggregate function.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@181
PS8, Line 181: because the inner count(*) is not materialized. The outer
             : # count(*) does not reference a base table.
nit. The optimization does apply to the inner count(*).


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@277
PS8, Line 277: there
nit. all predicates are on partition columns only.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@386
PS8, Line 386: # Optimization is not applied when there is a distinct agg.
nit. in general.



-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 30 Mar 2022 18:19:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8014/


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 08:54:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 16: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 16
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Apr 2022 08:56:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#12).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
25 files changed, 980 insertions(+), 242 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/12
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#13).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M testdata/workloads/functional-query/queries/QueryTest/mixed-format.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters_mt_dop.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
31 files changed, 1,082 insertions(+), 249 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/13
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 13
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 13:

Patch set 13 address test breakage due to 'RowsRead' counter assertion.


-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 13
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Apr 2022 22:30:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18327

to look at the new patch set (#14).

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M testdata/workloads/functional-query/queries/QueryTest/mixed-format.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters_mt_dop.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
31 files changed, 1,045 insertions(+), 255 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/18327/14
-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 14
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Optimize count(star) for ORC scans

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Reviewed-on: http://gerrit.cloudera.org:8080/18327
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M testdata/workloads/functional-query/queries/QueryTest/mixed-format.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
M testdata/workloads/functional-query/queries/QueryTest/runtime_filters_mt_dop.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/custom_cluster/test_executor_groups.py
M tests/custom_cluster/test_query_retries.py
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
31 files changed, 1,046 insertions(+), 256 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 17
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>