You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Riza Suminto (Code Review)" <ge...@cloudera.org> on 2023/05/24 17:35:56 UTC

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Riza Suminto has uploaded this change for review. ( http://gerrit.cloudera.org:8080/19927


Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................

IMPALA-11123: Reimplement ORC optimized count star

Commit 7ca20b3c94b1c9c1ddd4ed1e89f0969a0df55330 revert the original
optimized count(star) for ORC scan from commit
f932d78ad0a30e322d59fc39072f710f889d2135 (gerrit review
http://gerrit.cloudera.org:8080/18327). The revert is necessary since
the unification of count star and zero slot functions into
HdfsColumnarScanner and causing significant regression for non-optimized
counts star query in parquet format (over 15% slower
MaterializeTupleTime).

This patch reimplements optimized count(star) for ORC scan code path
while minimizing the code changes needed for parquet scan code path.
After this patch, ORC and parquet code path will have only the following
new things in common:
- THdfsScanNode.count_star_slot_offset renamed to
  THdfsScanNode.star_slot_offset
- New field is_footer_scanner_ added in HdfsColumnarScanner to memorize
  if the scanner's scan range is a footer range or not.
- Static function HdfsScanner::IssueFooterRanges adds one function
  parameter is_read_metadata_only. Parquet scan will pass
  ReadsParquetMetadataOnly function as argument for
  is_read_metadata_only parameter, while ORC scan will pass
  ReadsOrcMetadataOnly function. ReadsParquetMetadataOnly only checks
  for HdfsScanNodeBase::IsZeroSlotTableScan(), thus behavior remains
  unchanged for parquet scan code path. On the other hand,
  ReadsOrcMetadataOnly is more elaborate by additionally checking for
  HdfsScanNodeBase::optimize_count_star() and Hive ACID compaction
  status.

The structure of HdfsParquetScanner::GetNextInternal() remains
unchanged. Its zero scan slot code path is still served through num_rows
metadata from the parquet footer, while the optimized count star code
path still loops over row groups.

HdfsOrcScanner adds optimization_mode_ field, an enum of
OptimizationMode, with three values: NONE, ZERO_SLOT_SCAN, and
OPTIMIZED_COUNT_STAR. It is initialized in HdfsOrcScanner::Open() by
considering multiple factors including the table's Hive ACID compaction
status. HdfsOrcScanner::GetNextInternal() is reorganized to take either
of the three code paths. Unlike HdfsParquetScanner, both the optimized
count star and zero slot scan code path of HdfsOrcScanner can be served
through ORC file metadata.

The following table shows single-node benchmark result of 3 count query
variant on TPC-DS scale 10, both in ORC and parquet format.

+-----------+---------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload  | Query                     | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-----------+---------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | parquet / none / none | 0.18   | 0.17        |   +4.92%   | * 25.40% * | * 17.73% *     | 9     |   +1.32%       | 1.05    | 0.46  |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | orc / def / block     | 0.16   | 0.15        |   +4.02%   | * 13.21% * | * 11.03% *     | 9     |   +1.20%       | 1.58    | 0.69  |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | orc / def / block     | 0.37   | 0.36        |   +0.33%   |   7.56%    |   5.94%        | 9     |   -0.21%       | -0.32   | 0.10  |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | parquet / none / none | 0.22   | 0.23        |   -3.63%   | * 17.21% * | * 26.61% *     | 9     |   +1.81%       | 0.53    | -0.35 |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet / none / none | 0.27   | 0.28        |   -2.28%   |   9.00%    | * 10.28% *     | 9     |   +0.13%       | 0.21    | -0.51 |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | orc / def / block     | 0.16   | 0.20        | I -19.23%  | * 10.73% * | * 16.89% *     | 9     | I -32.40%      | -1.89   | -3.04 |
+-----------+---------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

We also benchmark this patch on parquet TPC-DS scale 3000 and measure
the MaterializeTupleTime. Table below shows average and sttdev of
MaterializeTupleTime in seconds, before and after patch.

+---------------------------+----------+--------------+----------+--------------+
|          SqlName          | Base Avg | Base Stddev  |  New Avg |  New Stddev  |
+---------------------------+----------+--------------+----------+--------------+
| TPCDS-Q_COUNT_OPTIMIZED   | 0.000002 | 2.618618e-07 | 0.000002 | 3.003614e-07 |
| TPCDS-Q_COUNT_UNOPTIMIZED | 1.242631 | 3.420549e-02 | 1.257202 | 3.562676e-02 |
| TPCDS-Q_COUNT_ZERO_SLOT   | 0.027607 | 1.810048e-03 | 0.027012 | 1.531121e-03 |
+---------------------------+----------+--------------+----------+--------------+

Testing:
- Added 3 count query variant into tpcds workload:
  TPCDS-Q_COUNT_OPTIMIZED, TPCDS-Q_COUNT_UNOPTIMIZED, and
  TPCDS-Q_COUNT_ZERO_SLOT.
- Restore PlannerTest.testOrcStatsAgg
- Restore TestAggregationQueriesRunOnce and
  TestAggregationQueriesRunOnce::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_optimized.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_unoptimized.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_zero_slot.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
23 files changed, 931 insertions(+), 153 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/19927/1
-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 1
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
File testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test:

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@18
PS3, Line 18: ---- QUERY
> Where was the counter fixed? I couldn't find related change in cpp
The counter increment is already corrected when I simplify the patch at ps3. I did forget to adjust the test file and fix it here at ps4.


http://gerrit.cloudera.org:8080/#/c/19927/4/tests/util/test_file_parser.py
File tests/util/test_file_parser.py:

http://gerrit.cloudera.org:8080/#/c/19927/4/tests/util/test_file_parser.py@185
PS4, Line 185: 
> nit: Make the prefix 'table_format=' and use len(prefix) to replace the har
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 08:55:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, David Rorke, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19927

to look at the new patch set (#4).

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................

IMPALA-11123: Reimplement ORC optimized count star

Commit 7ca20b3c94b1c9c1ddd4ed1e89f0969a0df55330 revert the original
optimized count(star) for ORC scan from commit
f932d78ad0a30e322d59fc39072f710f889d2135 (gerrit review
http://gerrit.cloudera.org:8080/18327). The revert is necessary since
the unification of count star and zero slot functions into
HdfsColumnarScanner and causing significant regression for non-optimized
counts star query in parquet format (over 15% slower
MaterializeTupleTime).

This patch reimplements optimized count(star) for ORC scan code path
while minimizing the code changes needed for parquet scan code path.
After this patch, ORC and parquet code path will have only the following
new things in common:
- THdfsScanNode.count_star_slot_offset renamed to
  THdfsScanNode.star_slot_offset
- HdfsScanner::IssueFooterRanges will only issue footer ranges if
  IsZeroSlotTableScan() or optimize_count_star() is true (made possible
  for parquet by IMPALA-12631).

The structure of HdfsParquetScanner::GetNextInternal() remains
unchanged. Its zero scan slot code path is still served through num_rows
metadata from the parquet footer, while the optimized count star code
path still loops over row groups metadata (also from parquet footer).

The following table shows single-node benchmark result of 3 count query
variant on TPC-DS scale 10, both in ORC and parquet format, looped 9
times.

+-----------+---------------------------+---------+--------+-------------+------------+
| Workload  | Query                     | Format  | Avg(s) | Base Avg(s) | Delta(Avg) |
+-----------+---------------------------+---------+--------+-------------+------------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | orc     | 0.30   | 0.28        |   +6.50%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | parquet | 0.14   | 0.14        |   +1.56%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | parquet | 0.27   | 0.27        |   +1.42%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | orc     | 0.28   | 0.29        |   -3.03%   |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet | 0.21   | 0.22        |   -4.45%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | orc     | 0.14   | 0.21        | I -35.92%  |
+-----------+---------------------------+---------+--------+-------------+------------+

Testing:
- Restore PlannerTest.testOrcStatsAgg
- Restore TestAggregationQueriesRunOnce and
  TestAggregationQueriesRunOnce::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
17 files changed, 782 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/19927/4
-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Zihao Ye, David Rorke, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19927

to look at the new patch set (#5).

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................

IMPALA-11123: Reimplement ORC optimized count star

Commit 7ca20b3c94b1c9c1ddd4ed1e89f0969a0df55330 revert the original
optimized count(star) for ORC scan from commit
f932d78ad0a30e322d59fc39072f710f889d2135 (gerrit review
http://gerrit.cloudera.org:8080/18327). The revert is necessary since
the unification of count star and zero slot functions into
HdfsColumnarScanner and causing significant regression for non-optimized
counts star query in parquet format (over 15% slower
MaterializeTupleTime).

This patch reimplements optimized count(star) for ORC scan code path
while minimizing the code changes needed for parquet scan code path.
After this patch, ORC and parquet code path will have only the following
new things in common:
- THdfsScanNode.count_star_slot_offset renamed to
  THdfsScanNode.star_slot_offset
- HdfsScanner::IssueFooterRanges will only issue footer ranges if
  IsZeroSlotTableScan() or optimize_count_star() is true (made possible
  for parquet by IMPALA-12631).

The structure of HdfsParquetScanner::GetNextInternal() remains
unchanged. Its zero scan slot code path is still served through num_rows
metadata from the parquet footer, while the optimized count star code
path still loops over row groups metadata (also from parquet footer).

The following table shows single-node benchmark result of 3 count query
variant on TPC-DS scale 10, both in ORC and parquet format, looped 9
times.

+-----------+---------------------------+---------+--------+-------------+------------+
| Workload  | Query                     | Format  | Avg(s) | Base Avg(s) | Delta(Avg) |
+-----------+---------------------------+---------+--------+-------------+------------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | orc     | 0.30   | 0.28        |   +6.50%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | parquet | 0.14   | 0.14        |   +1.56%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | parquet | 0.27   | 0.27        |   +1.42%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | orc     | 0.28   | 0.29        |   -3.03%   |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet | 0.21   | 0.22        |   -4.45%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | orc     | 0.14   | 0.21        | I -35.92%  |
+-----------+---------------------------+---------+--------+-------------+------------+

Testing:
- Restore PlannerTest.testOrcStatsAgg
- Restore TestAggregationQueriesRunOnce and
  TestAggregationQueriesRunOnce::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
17 files changed, 783 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/19927/5
-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc
File be/src/exec/orc/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc@409
PS3, Line 409:   if (scan_node_->optimize_count_star()) {
> In HdfsScanNode.java, I choose to not enable optimized count star for ACID 
I see, adding the DCHECKs looks good to me.


http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
File testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test:

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@18
PS3, Line 18: ---- QUERY
> Fixed this. Both Parquet and ORC should hold the same assertion.
Where was the counter fixed? I couldn't find related change in cpp



-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 08:17:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10296/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 16:05:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 6:

(1 comment)

ps6 add select count star test over full acid table.

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
File testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test:

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@18
PS3, Line 18: ---- QUERY
> My previous statement is wrong. Let me double check again.
ps6 increment rows_read_counter().



-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 10:21:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc
File be/src/exec/orc/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc@409
PS3, Line 409:   if (scan_node_->optimize_count_star()) {
it is possible for row_batches_need_validation_ to be true while scan_node_->optimize_count_star() is also true?  I think that we cannot use the optimization in that case.


http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc@811
PS3, Line 811:   else if (scan_node_->IsZeroSlotTableScan() && !row_batches_need_validation_) {
this formatting is very unusual in Impala - can you move the else if to line 807 after } and move the comment inside the block?


http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
File testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test:

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@18
PS3, Line 18: aggregation(SUM, RowsRead): 0
Is there a reason behind returning number of files in Parquet and 0 in ORC? The ORC scanner could also increment the counter in the count(*) path with COUNTER_ADD(scan_node_->rows_read_counter(), 1);


http://gerrit.cloudera.org:8080/#/c/19927/3/tests/util/test_file_parser.py
File tests/util/test_file_parser.py:

http://gerrit.cloudera.org:8080/#/c/19927/3/tests/util/test_file_parser.py@269
PS3, Line 269:         if subsection_comment is not None and subsection_comment is not "":
this block looks more complex than necessary

can you create function like parse_runtime_profile_table_formats() that does the parsing / allowed format checking and returns the list of parsed_formats?



-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Feb 2024 14:38:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Zihao Ye (Code Review)" <ge...@cloudera.org>.
Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 4: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19927/4/tests/util/test_file_parser.py
File tests/util/test_file_parser.py:

http://gerrit.cloudera.org:8080/#/c/19927/4/tests/util/test_file_parser.py@185
PS4, Line 185: 13
nit: Make the prefix 'table_format=' and use len(prefix) to replace the hardcoded number, which will be more readable.



-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 08:02:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 6: Code-Review+2

thanks for the changes, lgtm!


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 12:52:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, David Rorke, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19927

to look at the new patch set (#3).

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................

IMPALA-11123: Reimplement ORC optimized count star

Commit 7ca20b3c94b1c9c1ddd4ed1e89f0969a0df55330 revert the original
optimized count(star) for ORC scan from commit
f932d78ad0a30e322d59fc39072f710f889d2135 (gerrit review
http://gerrit.cloudera.org:8080/18327). The revert is necessary since
the unification of count star and zero slot functions into
HdfsColumnarScanner and causing significant regression for non-optimized
counts star query in parquet format (over 15% slower
MaterializeTupleTime).

This patch reimplements optimized count(star) for ORC scan code path
while minimizing the code changes needed for parquet scan code path.
After this patch, ORC and parquet code path will have only the following
new things in common:
- THdfsScanNode.count_star_slot_offset renamed to
  THdfsScanNode.star_slot_offset
- HdfsScanner::IssueFooterRanges will only issue footer ranges if
  IsZeroSlotTableScan() or optimize_count_star() is true (made possible
  for parquet by IMPALA-12631).

The structure of HdfsParquetScanner::GetNextInternal() remains
unchanged. Its zero scan slot code path is still served through num_rows
metadata from the parquet footer, while the optimized count star code
path still loops over row groups metadata (also from parquet footer).

The following table shows single-node benchmark result of 3 count query
variant on TPC-DS scale 10, both in ORC and parquet format, looped 9
times.

+-----------+---------------------------+---------+--------+-------------+------------+
| Workload  | Query                     | Format  | Avg(s) | Base Avg(s) | Delta(Avg) |
+-----------+---------------------------+---------+--------+-------------+------------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | orc     | 0.30   | 0.28        |   +6.50%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | parquet | 0.14   | 0.14        |   +1.56%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | parquet | 0.27   | 0.27        |   +1.42%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | orc     | 0.28   | 0.29        |   -3.03%   |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet | 0.21   | 0.22        |   -4.45%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | orc     | 0.14   | 0.21        | I -35.92%  |
+-----------+---------------------------+---------+--------+-------------+------------+

Testing:
- Restore PlannerTest.testOrcStatsAgg
- Restore TestAggregationQueriesRunOnce and
  TestAggregationQueriesRunOnce::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
17 files changed, 768 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/19927/3
-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc
File be/src/exec/orc/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc@409
PS3, Line 409:   if (scan_node_->optimize_count_star()) {
> it is possible for row_batches_need_validation_ to be true while scan_node_
In HdfsScanNode.java, I choose to not enable optimized count star for ACID table.
So row_batches_need_validation_ should be false here.
This is isFullAcidTable function in AcidUtils.java:

  public static boolean isFullAcidTable(Map<String, String> props) {                                                                                                                                                                                                                      
    return isTransactionalTable(props) && !isInsertOnlyTable(props);                                                                                                                                                                                                                      
  }

Should I disable optimized count star for all transactional table, just to be safe? (isTransactionalTable(props) == True)


http://gerrit.cloudera.org:8080/#/c/19927/3/be/src/exec/orc/hdfs-orc-scanner.cc@811
PS3, Line 811:     // batches and check their validity. In that case 'currentTransaction' is the only
> this formatting is very unusual in Impala - can you move the else if to lin
Done


http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
File testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test:

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@18
PS3, Line 18: ---- QUERY
> Is there a reason behind returning number of files in Parquet and 0 in ORC?
Fixed this. Both Parquet and ORC should hold the same assertion.


http://gerrit.cloudera.org:8080/#/c/19927/3/tests/util/test_file_parser.py
File tests/util/test_file_parser.py:

http://gerrit.cloudera.org:8080/#/c/19927/3/tests/util/test_file_parser.py@269
PS3, Line 269:         elif subsection_comment == 'ANY_OF':
> this block looks more complex than necessary
Done. Fixed couple flake8 errors as well.



-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Feb 2024 20:19:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, David Rorke, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19927

to look at the new patch set (#2).

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................

IMPALA-11123: Reimplement ORC optimized count star

Commit 7ca20b3c94b1c9c1ddd4ed1e89f0969a0df55330 revert the original
optimized count(star) for ORC scan from commit
f932d78ad0a30e322d59fc39072f710f889d2135 (gerrit review
http://gerrit.cloudera.org:8080/18327). The revert is necessary since
the unification of count star and zero slot functions into
HdfsColumnarScanner and causing significant regression for non-optimized
counts star query in parquet format (over 15% slower
MaterializeTupleTime).

This patch reimplements optimized count(star) for ORC scan code path
while minimizing the code changes needed for parquet scan code path.
After this patch, ORC and parquet code path will have only the following
new things in common:
- THdfsScanNode.count_star_slot_offset renamed to
  THdfsScanNode.star_slot_offset
- Static function HdfsScanner::IssueFooterRanges adds one function
  parameter is_read_metadata_only. Parquet scan will pass
  ReadsParquetMetadataOnly function as argument for
  is_read_metadata_only parameter, while ORC scan will pass
  ReadsOrcMetadataOnly function. ReadsParquetMetadataOnly only checks
  for HdfsScanNodeBase::IsZeroSlotTableScan(), thus behavior remains
  unchanged for parquet scan code path. On the other hand,
  ReadsOrcMetadataOnly is more elaborate by additionally checking for
  HdfsScanNodeBase::optimize_count_star() and Hive ACID compaction
  status.

The structure of HdfsParquetScanner::GetNextInternal() remains
unchanged. Its zero scan slot code path is still served through num_rows
metadata from the parquet footer, while the optimized count star code
path still loops over row groups metadata (also from parquet footer).

HdfsOrcScanner adds optimization_mode_ field, an enum of
OptimizationMode, with three values: NONE, ZERO_SLOT_SCAN, and
OPTIMIZED_COUNT_STAR. It is initialized in HdfsOrcScanner::Open() by
considering multiple factors including the table's Hive ACID compaction
status. HdfsOrcScanner::GetNextInternal() is reorganized to take either
of the three code paths.

The following table shows single-node benchmark result of 3 count query
variant on TPC-DS scale 10, both in ORC and parquet format.

+-----------+---------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload  | Query                     | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-----------+---------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | parquet / none / none | 0.18   | 0.17        |   +4.92%   | * 25.40% * | * 17.73% *     | 9     |   +1.32%       | 1.05    | 0.46  |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | orc / def / block     | 0.16   | 0.15        |   +4.02%   | * 13.21% * | * 11.03% *     | 9     |   +1.20%       | 1.58    | 0.69  |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | orc / def / block     | 0.37   | 0.36        |   +0.33%   |   7.56%    |   5.94%        | 9     |   -0.21%       | -0.32   | 0.10  |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | parquet / none / none | 0.22   | 0.23        |   -3.63%   | * 17.21% * | * 26.61% *     | 9     |   +1.81%       | 0.53    | -0.35 |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet / none / none | 0.27   | 0.28        |   -2.28%   |   9.00%    | * 10.28% *     | 9     |   +0.13%       | 0.21    | -0.51 |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | orc / def / block     | 0.16   | 0.20        | I -19.23%  | * 10.73% * | * 16.89% *     | 9     | I -32.40%      | -1.89   | -3.04 |
+-----------+---------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

We also benchmark this patch on parquet TPC-DS scale 3000 and measure
the MaterializeTupleTime. Table below shows average and sttdev of
MaterializeTupleTime in seconds, before and after patch.

+---------------------------+----------+--------------+----------+--------------+
|          SqlName          | Base Avg | Base Stddev  |  New Avg |  New Stddev  |
+---------------------------+----------+--------------+----------+--------------+
| TPCDS-Q_COUNT_OPTIMIZED   | 0.000002 | 2.618618e-07 | 0.000002 | 3.003614e-07 |
| TPCDS-Q_COUNT_UNOPTIMIZED | 1.242631 | 3.420549e-02 | 1.257202 | 3.562676e-02 |
| TPCDS-Q_COUNT_ZERO_SLOT   | 0.027607 | 1.810048e-03 | 0.027012 | 1.531121e-03 |
+---------------------------+----------+--------------+----------+--------------+

Testing:
- Added 3 count query variant into tpcds workload:
  TPCDS-Q_COUNT_OPTIMIZED, TPCDS-Q_COUNT_UNOPTIMIZED, and
  TPCDS-Q_COUNT_ZERO_SLOT.
- Restore PlannerTest.testOrcStatsAgg
- Restore TestAggregationQueriesRunOnce and
  TestAggregationQueriesRunOnce::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
18 files changed, 905 insertions(+), 167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/19927/2
-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Zihao Ye, David Rorke, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19927

to look at the new patch set (#6).

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................

IMPALA-11123: Reimplement ORC optimized count star

Commit 7ca20b3c94b1c9c1ddd4ed1e89f0969a0df55330 revert the original
optimized count(star) for ORC scan from commit
f932d78ad0a30e322d59fc39072f710f889d2135 (gerrit review
http://gerrit.cloudera.org:8080/18327). The revert is necessary since
the unification of count star and zero slot functions into
HdfsColumnarScanner and causing significant regression for non-optimized
counts star query in parquet format (over 15% slower
MaterializeTupleTime).

This patch reimplements optimized count(star) for ORC scan code path
while minimizing the code changes needed for parquet scan code path.
After this patch, ORC and parquet code path will have only the following
new things in common:
- THdfsScanNode.count_star_slot_offset renamed to
  THdfsScanNode.star_slot_offset
- HdfsScanner::IssueFooterRanges will only issue footer ranges if
  IsZeroSlotTableScan() or optimize_count_star() is true (made possible
  for parquet by IMPALA-12631).

The structure of HdfsParquetScanner::GetNextInternal() remains
unchanged. Its zero scan slot code path is still served through num_rows
metadata from the parquet footer, while the optimized count star code
path still loops over row groups metadata (also from parquet footer).

The following table shows single-node benchmark result of 3 count query
variant on TPC-DS scale 10, both in ORC and parquet format, looped 9
times.

+-----------+---------------------------+---------+--------+-------------+------------+
| Workload  | Query                     | Format  | Avg(s) | Base Avg(s) | Delta(Avg) |
+-----------+---------------------------+---------+--------+-------------+------------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | orc     | 0.30   | 0.28        |   +6.50%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | parquet | 0.14   | 0.14        |   +1.56%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | parquet | 0.27   | 0.27        |   +1.42%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | orc     | 0.28   | 0.29        |   -3.03%   |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet | 0.21   | 0.22        |   -4.45%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | orc     | 0.14   | 0.21        | I -35.92%  |
+-----------+---------------------------+---------+--------+-------------+------------+

Testing:
- Restore PlannerTest.testOrcStatsAgg
- Restore TestAggregationQueriesRunOnce and
  TestAggregationQueriesRunOnce::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
17 files changed, 809 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/27/19927/6
-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 16:05:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 2:

ps2 is a rebase.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2024 16:59:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/15248/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2024 18:11:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/13111/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 1
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 May 2023 17:58:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 7: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 20:43:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/15271/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Feb 2024 02:19:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/15247/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2024 17:13:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/15287/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 09:20:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 3:

ps3 simplify the patch.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Feb 2024 01:54:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
File testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test:

http://gerrit.cloudera.org:8080/#/c/19927/3/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test@18
PS3, Line 18: ---- QUERY
> The counter increment is already corrected when I simplify the patch at ps3
My previous statement is wrong. Let me double check again.



-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 09:47:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/15280/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Feb 2024 20:29:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/15289/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>
Gerrit-Comment-Date: Fri, 23 Feb 2024 10:41:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11123: Reimplement ORC optimized count star

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/19927 )

Change subject: IMPALA-11123: Reimplement ORC optimized count star
......................................................................

IMPALA-11123: Reimplement ORC optimized count star

Commit 7ca20b3c94b1c9c1ddd4ed1e89f0969a0df55330 revert the original
optimized count(star) for ORC scan from commit
f932d78ad0a30e322d59fc39072f710f889d2135 (gerrit review
http://gerrit.cloudera.org:8080/18327). The revert is necessary since
the unification of count star and zero slot functions into
HdfsColumnarScanner and causing significant regression for non-optimized
counts star query in parquet format (over 15% slower
MaterializeTupleTime).

This patch reimplements optimized count(star) for ORC scan code path
while minimizing the code changes needed for parquet scan code path.
After this patch, ORC and parquet code path will have only the following
new things in common:
- THdfsScanNode.count_star_slot_offset renamed to
  THdfsScanNode.star_slot_offset
- HdfsScanner::IssueFooterRanges will only issue footer ranges if
  IsZeroSlotTableScan() or optimize_count_star() is true (made possible
  for parquet by IMPALA-12631).

The structure of HdfsParquetScanner::GetNextInternal() remains
unchanged. Its zero scan slot code path is still served through num_rows
metadata from the parquet footer, while the optimized count star code
path still loops over row groups metadata (also from parquet footer).

The following table shows single-node benchmark result of 3 count query
variant on TPC-DS scale 10, both in ORC and parquet format, looped 9
times.

+-----------+---------------------------+---------+--------+-------------+------------+
| Workload  | Query                     | Format  | Avg(s) | Base Avg(s) | Delta(Avg) |
+-----------+---------------------------+---------+--------+-------------+------------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | orc     | 0.30   | 0.28        |   +6.50%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | parquet | 0.14   | 0.14        |   +1.56%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | parquet | 0.27   | 0.27        |   +1.42%   |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | orc     | 0.28   | 0.29        |   -3.03%   |
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet | 0.21   | 0.22        |   -4.45%   |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | orc     | 0.14   | 0.21        | I -35.92%  |
+-----------+---------------------------+---------+--------+-------------+------------+

Testing:
- Restore PlannerTest.testOrcStatsAgg
- Restore TestAggregationQueriesRunOnce and
  TestAggregationQueriesRunOnce::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Reviewed-on: http://gerrit.cloudera.org:8080/19927
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/orc/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
A testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
M testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/query_test/test_aggregation.py
M tests/query_test/test_scanners.py
M tests/util/test_file_parser.py
17 files changed, 809 insertions(+), 98 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/19927
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I5971c8f278e1dee44e2a8dd4d2f043d22ebf5d17
Gerrit-Change-Number: 19927
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <ey...@163.com>