You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2022/03/16 13:35:35 UTC

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18325


Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................

IMPALA-11185: Reuse orc row batch in the scanner life-cycle

In HdfsOrcScanner::AssembleRows(), we always re-create a
orc::ColumnVectorBatch. The ideal pattern is reusing the batch and only
destroyed it when the scanner is closed.

This save half of the scanner time in some TPCH queries. See the flame
graph in JIRA description.

Tests:
 - Run CORE test

Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
---
M be/src/exec/hdfs-orc-scanner.cc
1 file changed, 2 insertions(+), 5 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/25/18325/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7942/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 18 Mar 2022 14:29:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Riza Suminto, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18325

to look at the new patch set (#2).

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................

IMPALA-11185: Reuse orc row batch in the scanner life-cycle

In HdfsOrcScanner::AssembleRows(), we always re-create a
orc::ColumnVectorBatch. The ideal pattern is reusing the batch and only
destroying it when the scanner is closed.

This save half of the scanner time in some TPCH queries. See the flame
graph in JIRA description.

Tests:
 - Run CORE test

Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
---
M be/src/exec/hdfs-orc-scanner.cc
1 file changed, 2 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/25/18325/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10295/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Mar 2022 13:56:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Riza Suminto, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18325

to look at the new patch set (#3).

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................

IMPALA-11185: Reuse orc row batch in the scanner life-cycle

In HdfsOrcScanner::AssembleRows(), we always re-create a
orc::ColumnVectorBatch. The ideal pattern is reusing the batch and only
destroying it when the scanner is closed.

This save half of the scanner time in some TPCH queries. See the flame
graph in JIRA description.

Tests:
 - Run CORE test

Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
---
M be/src/exec/hdfs-orc-scanner.cc
1 file changed, 2 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/25/18325/3
-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 3: Code-Review+2

Carry on Csaba's +2.


-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 18 Mar 2022 14:29:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 2: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18325/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18325/2//COMMIT_MSG@13
PS2, Line 13: This save half of the scanner time in some TPCH queries. See the flame
            : graph in JIRA description.
Wow, great improvement!



-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Mar 2022 13:49:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18325/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18325/1//COMMIT_MSG@11
PS1, Line 11: destroyin
> destroying
Done


http://gerrit.cloudera.org:8080/#/c/18325/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18325/1/be/src/exec/hdfs-orc-scanner.cc@444
PS1, Line 444: orc_root_batch_ = tmp_row_reader->createRowBatch(state_->batch_size());
> Do we need to clear orc_root_batch_ in between AssembleRow?
The ORC lib will do this for us and reuse the buffers.
https://github.com/apache/orc/blob/199ab7711d6df98cfdc2ac06436860e05bb9a65a/c++/src/Reader.cc#L1121
Take int column as an example, the column reader materializes data at the begining of the buffer.
https://github.com/apache/orc/blob/5c48e291f3cbb4060243895739977102f82b861a/c++/src/ColumnReader.cc#L293

The ORC tools also reuses the batch:
https://github.com/apache/orc/blob/9d45c92402cc8d62b363bebab09f7936b1792e5f/tools/src/FileScan.cc#L34
https://github.com/apache/orc/blob/9d45c92402cc8d62b363bebab09f7936b1792e5f/tools/src/FileContents.cc#L37



-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Mar 2022 01:49:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 3: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 18 Mar 2022 19:26:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18325/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18325/1//COMMIT_MSG@11
PS1, Line 11: destroyed
destroying


http://gerrit.cloudera.org:8080/#/c/18325/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18325/1/be/src/exec/hdfs-orc-scanner.cc@444
PS1, Line 444: orc_root_batch_ = tmp_row_reader->createRowBatch(state_->batch_size());
Do we need to clear orc_root_batch_ in between AssembleRow?
Previously, we don't need to, because we always create a new one.



-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Mar 2022 15:32:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10307/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 18 Mar 2022 14:47:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................

IMPALA-11185: Reuse orc row batch in the scanner life-cycle

In HdfsOrcScanner::AssembleRows(), we always re-create a
orc::ColumnVectorBatch. The ideal pattern is reusing the batch and only
destroying it when the scanner is closed.

This save half of the scanner time in some TPCH queries. See the flame
graph in JIRA description.

Tests:
 - Run CORE test

Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Reviewed-on: http://gerrit.cloudera.org:8080/18325
Reviewed-by: Quanlong Huang <hu...@gmail.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-orc-scanner.cc
1 file changed, 2 insertions(+), 5 deletions(-)

Approvals:
  Quanlong Huang: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-11185: Reuse orc row batch in the scanner life-cycle

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18325 )

Change subject: IMPALA-11185: Reuse orc row batch in the scanner life-cycle
......................................................................


Patch Set 2: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18325/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18325/1/be/src/exec/hdfs-orc-scanner.cc@444
PS1, Line 444: orc_root_batch_ = tmp_row_reader->createRowBatch(state_->batch_size());
> The ORC lib will do this for us and reuse the buffers.
Thanks for the explanations!



-- 
To view, visit http://gerrit.cloudera.org:8080/18325
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I03887ed94af2ff03d67cd00c79375c734a75af62
Gerrit-Change-Number: 18325
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Mar 2022 02:10:02 +0000
Gerrit-HasComments: Yes