You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2022/03/27 23:24:34 UTC

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18359


Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................

IMPALA-11204: template implementation for OrcStringColumnReader

There are some checks in OrcStringColumnReader::ReadValue() that we can
determine outside the scope of this method. They should be optimized
since this is a critical method that will be executed for each row (and
for each string column). With these checks, the method is too complex to
be inlined by the compiler in OrcBatchedReader::ReadValueBatch().

This patch templates OrcStringColumnReader with two parameters, one for
whether the column is dictionary encoded, the other for the target slot
type (i.e. STRING/CHAR/VARCHAR). Compiler is able to inline
OrcStringColumnReader::ReadValue() after this patch.

The encoding of a column can change in different ORC stripes. So we have
to re-create the column readers for each stripe. Note that we already do
so for orc::RowReader. So this patch changes the life-cycle of
OrcColumnReaders to match the processing of each stripe. They are now
managed by std::unique_ptr. This requires OrcStructReader be defined
earlier than HdfsOrcScanner. So we include orc-column-readers.h in
hdfs-orc-scanner.h and move all code that depends on the scanner
implementation in orc-column-readers.h to the source file.

Ran a single node perf test on TPCH(30) on my dev box using 3 impalad
instances. There are some improvements and no significant regressions:
+----------+--------+-------------+------------+
| Query    | Avg(s) | Base Avg(s) | Delta(Avg) |
+----------+--------+-------------+------------+
| TPCH-Q19 | 5.42   | 5.78        | I -6.21%   |
| TPCH-Q4  | 3.43   | 3.69        | I -7.25%   |
| TPCH-Q6  | 2.25   | 2.45        | I -8.18%   |
| TPCH-Q12 | 3.95   | 4.54        | I -13.04%  |
+----------+--------+-------------+------------+
File Format: orc/snap/block

Tests:
 - Ran CORE tests.

Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 240 insertions(+), 195 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/59/18359/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Patch Set 2:

(1 comment)

Thank Csaba and Riza! A new approach based on Csaba's suggestion is uploaded at https://gerrit.cloudera.org/c/18366

http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@922
PS2, Line 922: make_unique
> An alternative to recreating readers could be to override ReadValueBatch in
That's a much simpler approach indeed! My first choice is something similar but got stuck by some template issues.

I have a try with this and the performance improvement is similar. I tend to abandon this. We can revive it if one day we need to change the life cycle of readers.



-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 30 Mar 2022 06:46:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/10342/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 23:44:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@203
PS2, Line 203: Status HdfsOrcScanner::StartColumnReading(const orc::StripeInformation& stripe) {
Since we have 'stripe_' as a new class member in this patch, can we remove this 'stripe' parameter?



-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 15:32:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc@193
PS1, Line 193: Status OrcBatchedReader<Final>::ReadValueBatch(int row_idx, ScratchTupleBatch* scratch_batch,
line too long (93 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Sun, 27 Mar 2022 23:25:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10344/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 28 Mar 2022 06:12:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18359

to look at the new patch set (#2).

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................

IMPALA-11204: template implementation for OrcStringColumnReader

There are some checks in OrcStringColumnReader::ReadValue() that we can
determine outside the scope of this method. They should be optimized
since this is a critical method that will be executed for each row (and
for each string column). With these checks, the method is too complex to
be inlined by the compiler in OrcBatchedReader::ReadValueBatch().

This patch templates OrcStringColumnReader with two parameters, one for
whether the column is dictionary encoded, the other for the target slot
type (i.e. STRING/CHAR/VARCHAR). Compiler is able to inline
OrcStringColumnReader::ReadValue() after this patch.

The encoding of a column can change in different ORC stripes. So we have
to re-create the column readers for each stripe. Note that we already do
so for orc::RowReader. So this patch changes the life-cycle of
OrcColumnReaders to match the processing of each stripe. They are now
managed by std::unique_ptr. This requires OrcStructReader be defined
earlier than HdfsOrcScanner. So we include orc-column-readers.h in
hdfs-orc-scanner.h and move all code that depends on the scanner
implementation in orc-column-readers.h to the source file.

Ran a single node perf test on TPCH(30) on my dev box using 3 impalad
instances. There are some improvements and no significant regressions:
+----------+--------+-------------+------------+
| Query    | Avg(s) | Base Avg(s) | Delta(Avg) |
+----------+--------+-------------+------------+
| TPCH-Q19 | 5.42   | 5.78        | I -6.21%   |
| TPCH-Q4  | 3.43   | 3.69        | I -7.25%   |
| TPCH-Q6  | 2.25   | 2.45        | I -8.18%   |
| TPCH-Q12 | 3.95   | 4.54        | I -13.04%  |
+----------+--------+-------------+------------+
File Format: orc/snap/block

Tests:
 - Ran CORE tests.

Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 240 insertions(+), 196 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/59/18359/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/18359/1/be/src/exec/orc-column-readers.cc@193
PS1, Line 193:     ScratchTupleBatch* scratch_batch, MemPool* pool, int scratch_batch_idx) {
> line too long (93 > 90)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 28 Mar 2022 05:52:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Patch Set 2: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@919
PS2, Line 919:  in case 
"in case" suggests that we only recreate it when there is an encoding change, but we always recreate it. I think that "to handle column encoding changes" would be clearer.


http://gerrit.cloudera.org:8080/#/c/18359/2/be/src/exec/hdfs-orc-scanner.cc@922
PS2, Line 922: make_unique
An alternative to recreating readers could be to override ReadValueBatch in OrcStringColumnReader and call a templated version of ReadValue from there.

I don't think that this would be a better solution in general, but it could reduce the size of the change.



-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 29 Mar 2022 13:50:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11204: template implementation for OrcStringColumnReader

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has abandoned this change. ( http://gerrit.cloudera.org:8080/18359 )

Change subject: IMPALA-11204: template implementation for OrcStringColumnReader
......................................................................


Abandoned

Choose another solution. See https://gerrit.cloudera.org/c/18366
-- 
To view, visit http://gerrit.cloudera.org:8080/18359
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I166b8ad3a959e97a3911da968b8e76bc337e5fa4
Gerrit-Change-Number: 18359
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>