You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2022/03/27 23:24:34 UTC
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18359
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
IMPALA-11204: template implementation for OrcStringColumnReader
There are some checks in OrcStringColumnReader::ReadValue() that we can
determine outside the scope of this method. They should be optimized
since this is a critical method that will be executed for each row (and
for each string column). With these checks, the method is too complex to
be inlined by the compiler in OrcBatchedReader::ReadValueBatch().
This patch templates OrcStringColumnReader with two parameters, one for
whether the column is dictionary encoded, the other for the target slot
type (i.e. STRING/CHAR/VARCHAR). Compiler is able to inline
OrcStringColumnReader::ReadValue() after this patch.
The encoding of a column can change in different ORC stripes. So we have
to re-create the column readers for each stripe. Note that we already do
so for orc::RowReader. So this patch changes the life-cycle of
OrcColumnReaders to match the processing of each stripe. They are now
managed by std::unique_ptr. This requires OrcStructReader be defined
earlier than HdfsOrcScanner. So we include orc-column-readers.h in
hdfs-orc-scanner.h and move all code that depends on the scanner
implementation in orc-column-readers.h to the source file.
Ran a single node perf test on TPCH(30) on my dev box using 3 impalad
instances. There are some improvements and no significant regressions:
+----------+--------+-------------+------------+
| Query | Avg(s) | Base Avg(s) | Delta(Avg) |
+----------+--------+-------------+------------+
| TPCH-Q19 | 5.42 | 5.78 | I -6.21% |
| TPCH-Q4 | 3.43 | 3.69 | I -7.25% |
| TPCH-Q6 | 2.25 | 2.45 | I -8.18% |
| TPCH-Q12 | 3.95 | 4.54 | I -13.04% |
+----------+--------+-------------+------------+
File Format: orc/snap/block
Tests:
- Ran CORE tests.
Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 240 insertions(+), 195 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/59/18359/1
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Patch Set 2:
(1 comment)
Thank Csaba and Riza! A new approach based on Csaba's suggestion is uploaded at https://gerrit.cloudera.org/c/18366
http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@922
PS2, Line 922: make_unique
> An alternative to recreating readers could be to override ReadValueBatch in
That's a much simpler approach indeed! My first choice is something similar but got stuck by some template issues.
I have a try with this and the performance improvement is similar. I tend to abandon this. We can revive it if one day we need to change the life cycle of readers.
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 30 Mar 2022 06:46:05 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Patch Set 1:
Build Failed
https://jenkins.impala.io/job/gerrit-code-review-checks/10342/ : Initial code review checks failed. See linked job for details on the failure.
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 23:44:29 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Patch Set 2:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@203
PS2, Line 203: Status HdfsOrcScanner::StartColumnReading(const orc::StripeInformation& stripe) {
Since we have 'stripe_' as a new class member in this patch, can we remove this 'stripe' parameter?
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 15:32:05 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Patch Set 1:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:
http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc@193
PS1, Line 193: Status OrcBatchedReader<Final>::ReadValueBatch(int row_idx, ScratchTupleBatch* scratch_batch,
line too long (93 > 90)
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 23:25:22 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Patch Set 2:
Build Successful
https://jenkins.impala.io/job/gerrit-code-review-checks/10344/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 28 Mar 2022 06:12:07 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/18359
to look at the new patch set (#2).
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
IMPALA-11204: template implementation for OrcStringColumnReader
There are some checks in OrcStringColumnReader::ReadValue() that we can
determine outside the scope of this method. They should be optimized
since this is a critical method that will be executed for each row (and
for each string column). With these checks, the method is too complex to
be inlined by the compiler in OrcBatchedReader::ReadValueBatch().
This patch templates OrcStringColumnReader with two parameters, one for
whether the column is dictionary encoded, the other for the target slot
type (i.e. STRING/CHAR/VARCHAR). Compiler is able to inline
OrcStringColumnReader::ReadValue() after this patch.
The encoding of a column can change in different ORC stripes. So we have
to re-create the column readers for each stripe. Note that we already do
so for orc::RowReader. So this patch changes the life-cycle of
OrcColumnReaders to match the processing of each stripe. They are now
managed by std::unique_ptr. This requires OrcStructReader be defined
earlier than HdfsOrcScanner. So we include orc-column-readers.h in
hdfs-orc-scanner.h and move all code that depends on the scanner
implementation in orc-column-readers.h to the source file.
Ran a single node perf test on TPCH(30) on my dev box using 3 impalad
instances. There are some improvements and no significant regressions:
+----------+--------+-------------+------------+
| Query | Avg(s) | Base Avg(s) | Delta(Avg) |
+----------+--------+-------------+------------+
| TPCH-Q19 | 5.42 | 5.78 | I -6.21% |
| TPCH-Q4 | 3.43 | 3.69 | I -7.25% |
| TPCH-Q6 | 2.25 | 2.45 | I -8.18% |
| TPCH-Q12 | 3.95 | 4.54 | I -13.04% |
+----------+--------+-------------+------------+
File Format: orc/snap/block
Tests:
- Ran CORE tests.
Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 240 insertions(+), 196 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/59/18359/2
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Patch Set 2:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:
http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc@193
PS1, Line 193: ScratchTupleBatch* scratch_batch, MemPool* pool, int scratch_batch_idx) {
> line too long (93 > 90)
Done
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 28 Mar 2022 05:52:33 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Patch Set 2: Code-Review+1
(2 comments)
http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@919
PS2, Line 919: in case
"in case" suggests that we only recreate it when there is an encoding change, but we always recreate it. I think that "to handle column encoding changes" would be clearer.
http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@922
PS2, Line 922: make_unique
An alternative to recreating readers could be to override ReadValueBatch in OrcStringColumnReader and call a templated version of ReadValue from there.
I don't think that this would be a better solution in general, but it could reduce the size of the change.
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 13:50:30 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader
Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has abandoned this change. ( http://gerrit.cloudera.org:8080/18359 )
Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................
Abandoned
Choose another solution. See https://gerrit.cloudera.org/c/18366
--
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>