You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2021/08/09 14:34:45 UTC

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Csaba Ringhofer has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15370


Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
11 files changed, 386 insertions(+), 188 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/2
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7447/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 14:30:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7454/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 06 Sep 2021 16:53:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 9: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7464/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 10 Sep 2021 04:50:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#17) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 517 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/17
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 21:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10003/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Jan 2022 20:00:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 29:

> Patch Set 29: Verified-1
> 
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7815/

This time it fails on flaky TestFetchAndSpooling.test_rows_sent_counters.
Previous pre-commit test does not hit this issue.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 29
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 Feb 2022 01:11:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7462/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 20:28:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#13) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
14 files changed, 475 insertions(+), 213 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/13
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#14) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Currently, there are corner cases where planner might underestimate the
number of async IO stream for a table. A case like "select count(*)"
over complex type column might have empty desc._getSlots() in
HdfsScanNode.computeMinColumnMemReservations, but
HdfsOrcScanner::StartColumnReading later see couple streams that are
eligible for async IO. In this situation, HdfsOrcScanner will try to
increase reservation 8KB (min_buffer_size) for each eligible stream.
Once the reservation increment fails, it will read the rest of the
streams synchronously.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+--------------------------+----------------------+---------------------+
| Total time               | ORC_ASYNC_READ=false | ORC_ASYNC_READ=true |
+--------------------------+----------------------+---------------------+
| DISABLE_DATA_CACHE=false |              3511075 |             3484736 |
| DISABLE_DATA_CACHE=true  |              5243337 |             4370095 |
+--------------------------+----------------------+---------------------+

+--------------------------+----------------------+---------------------+
| Geomean                  | ORC_ASYNC_READ=false | ORC_ASYNC_READ=true |
+--------------------------+----------------------+---------------------+
| DISABLE_DATA_CACHE=false |          12786.58042 |         12454.80365 |
| DISABLE_DATA_CACHE=true  |          23081.10888 |         16692.31512 |
+--------------------------+----------------------+---------------------+

Testing:
- Pass core tests.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 495 insertions(+), 216 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/14
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#16) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Currently, there are corner cases where planner might underestimate the
number of async IO stream for a table. A case like "select count(*)"
over complex type column might have empty desc._getSlots() in
HdfsScanNode.computeMinColumnMemReservations, but
HdfsOrcScanner::StartColumnReading later see couple streams that are
eligible for async IO. In this situation, HdfsOrcScanner will try to
increase reservation 8KB (min_buffer_size) for each eligible stream.
Once the reservation increment fails, it will read the rest of the
streams synchronously. This workaround most likely can be removed once
ORC-450 resolved.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 513 insertions(+), 220 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/16
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15370

to look at the new patch set (#5).

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
11 files changed, 395 insertions(+), 181 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/5
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9411/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 14:41:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 2:

Note that this was a quite hacky implementation - the problem is that when the ORC lib reads from the file, it only gives us an offset and length and we do not know which column (or stream) does it try to read. So we build a map of ranges beforehand (HdfsOrcScanner::StartColumnReading), and try to guess which range to advance during every individual read call and fall back to sync-IO if the read is not what we expected (HdfsOrcScanner::ScanRangeInputStream::read)

This seems to work, but changes in ORC lib can easily lead "disabling" async scanning by reading in unexpected patterns. The best would be to move most of the logic to ORC, so that it would return the ranges to us and identify the given range in every read call.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 09 Aug 2021 14:44:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 21:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@165
PS21, Line 165:       return false;
> We might exclude the index streams here because we're not sure about their 
After confirming with Csaba, the real reason is that we expect that indexes will be read in one batch, so async reading does not help much. They also not too large.
Will add comment to clarify this.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 Jan 2022 18:49:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 28:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10106/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 28
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 06 Feb 2022 00:36:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10044/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Jan 2022 06:02:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 26:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.h@393
PS25, Line 393: is within the remaining
> clarification idea: "within the remaining part of the footer stream"?
Done


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@109
PS25, Line 109:   Status status;
> The same logic is also implemented here:
Done


http://gerrit.cloudera.org:8080/#/c/15370/25/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/15370/25/common/thrift/Query.thrift@572
PS25, Line 572:   141: optional bool orc_async_read = true;
> I disagree with turning off async read by default, as this means that it is
Flipped this to true.


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2106
PS25, Line 2106: hould compute total reservation
> Should we check the orc_async_read flag here? I think this causes the estim
Done


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2143
PS25, Line 2143: 
               :    *
               :    * If there are nested co
> nit: Could you update this?
Done. Add more clarifications with examples.


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2161
PS25, Line 2161: ble 
> nit: long
Done


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2181
PS25, Line 2181: 
> nit: long
Done


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2199
PS25, Line 2199: 
               :         } else {
> typo, I mean "not doing it in async manner".
Patch set 26 retain this, but change the comment for better clarity.


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2274
PS25, Line 2274:  scan range. Returns t
> not used?
Done


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2287
PS25, Line 2287:     Preconditions.checkNotNull(column);
               :     long reservationBytes = DEFAULT_COLUMN_SCAN_RANGE_RESERVATION;
               :     ColumnStats stats = column.getStats();
               :     if (stats.hasAvgSize() && maxScanRangeNumRows_ != -1) {
               :       // Estimate the column's uncompressed data size based on row count and average
               :       // size.
               :       // TODO: Size of strings seems to be underestimated, as avg size returns the
               :       //       average length of the strings and does not include the 4 byte length
               :       //       field used in Parquet plain encoding. (IMPALA-8431)
               :       reservationBytes =
> We probably need to review this for ORC. We can create a follow up JIRA if 
Filed IMPALA-11104 for this.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Feb 2022 22:13:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 26:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10088/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Feb 2022 22:21:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 27:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: stitute("HdfsOrc
> Thanks for the explanation!
Agree. Filed IMPALA-11107 for this.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 27
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 03 Feb 2022 19:13:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 21:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@252
PS21, Line 252:   if (offset < current_position_) {
              :     DCHECK(false);
              :     string msg = Substitute(
              :         "ORC read request to already read range. offset: $0 length: $1 pos: $2 $3",
              :         offset, length, current_position_, debug());
              :     return Status(msg);
              :   }
As we recently figure out, there is a valid case where ORC lib decide to seek backward and hit our DCHECK here.
I'm thinking to fallback to readRandom here rather than failing it.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Jan 2022 01:58:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7464/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 22:04:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#10) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
12 files changed, 494 insertions(+), 195 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/10
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2179
PS16, Line 2179:     if (orcAsyncRead) columnByteSizes.add(DEFAULT_COLUMN_SCAN_RANGE_RESERVATION);
I think we can do better on counting min mem reservation here and other places for orcAsyncRead.
DEFAULT_COLUMN_SCAN_RANGE_RESERVATION is 4MB, and by keep appending this much per stream, we potentially going to over provision memory by a lot. Instead, we can split this 4MB allocation to all streams that belong to the column.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Dec 2021 17:56:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 16:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9884/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Dec 2021 20:49:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#15) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Currently, there are corner cases where planner might underestimate the
number of async IO stream for a table. A case like "select count(*)"
over complex type column might have empty desc._getSlots() in
HdfsScanNode.computeMinColumnMemReservations, but
HdfsOrcScanner::StartColumnReading later see couple streams that are
eligible for async IO. In this situation, HdfsOrcScanner will try to
increase reservation 8KB (min_buffer_size) for each eligible stream.
Once the reservation increment fails, it will read the rest of the
streams synchronously. This workaround most likely can be removed once
ORC-450 resolved.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 508 insertions(+), 220 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/15
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 5:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9423/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 06 Sep 2021 13:09:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 20:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc@202
PS20, Line 202:       string msg = Substitute(
              :           "Overlapping ORC column ranges. Last end: $0 Current offset:", last_end,
              :           range.offset_);
Missing $1 in the string template.


http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc@246
PS20, Line 246:     string msg = Substitute(
              :         "ORC read request out of range. offset: $0 length: $1", offset, length);
              :     return Status(msg);
Need to add filename here and some other site to help debugging if such error indeed happens.


http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc@951
PS20, Line 951:         VLOG_QUERY << "Encounter parse error: " << e.what();
              :         parse_status_ = Status(Substitute("Encounter parse error: $0.", e.what()));
Add filename information in this error logging.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Jan 2022 20:18:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#28) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-index.cc
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/request-ranges.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
19 files changed, 720 insertions(+), 279 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/28
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 28
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 28: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7812/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 28
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Feb 2022 08:00:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 29:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7815/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 29
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Feb 2022 17:35:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 28:

(1 comment)

Thank you Quanlong and Csaba!
Patch set 28 is a rebase over recent asf-master branch. Please reapply +1/+2.

http://gerrit.cloudera.org:8080/#/c/15370/27/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/27/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2077
PS27, Line 2077: wer bound is one minimu
> nit: could you keep the original indention?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 28
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 06 Feb 2022 00:15:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 24:

Patch set 24 is a rebase.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jan 2022 02:17:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(5 comments)

Thanks a lot for continuing the work on this!

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h@127
PS19, Line 127: lib.
> Done
My memories are a bit hazy, but using anything but 0 caused some issues, probably in performance. 0 means that the ORC lib should decide the size, and using anything else can lead to extra copying of memory. Should have added a comment about this :(

The value returned will be used here:
https://github.com/apache/orc/blob/f381f2dd78e6b305b8a4e0b07124034de72ef1a1/c%2B%2B/src/StripeStream.cc#L97
and will be passed here:
https://github.com/apache/orc/blob/d1755250c4a9841a332093bf8ef03107776eecca/c%2B%2B/src/Compression.cc#L401
and here:
https://github.com/apache/orc/blob/d1755250c4a9841a332093bf8ef03107776eecca/c%2B%2B/src/Compression.cc#L758

So the effect may actually depend on the type of compression used, namely is it blocked or not.

I would prefer to keep it 0 unless we have the capacity to benchmark its effect with different compressions.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@109
PS25, Line 109:   return split_range->expected_local() && (offset >= split_range->offset())
The same logic is also implemented here:
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/parquet/parquet-page-reader.cc#L89

An idea to merge the two is to move this function to ScanRange as a member function.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: ReadFooterStream
I don't get the logic of this function - do we assume that the footer range (first 100KB of the file AFAIK https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-scanner.h#L386)
will be in memory? But in that case why do we need the streaming logic, shouldn't it be random reads to a buffer?


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1377
PS25, Line 1377:   if (offset > stream_->file_offset()) {
Are we prepared for jumping backward instead of forward? Or this is impossible during footer processing?


http://gerrit.cloudera.org:8080/#/c/15370/25/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/15370/25/common/thrift/Query.thrift@572
PS25, Line 572:   141: optional bool orc_async_read = false;
I disagree with turning off async read by default, as this means that it is totally untested at the moment.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 31 Jan 2022 11:10:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#11) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
13 files changed, 517 insertions(+), 201 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/11
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 15:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc@821
PS13, Line 821:   DCHECK_LE(offset + len, GetFileDesc(metadata->partition_id, file)->file_length)
              :       << "Scan range beyond end of file (offset=" << offset << ", len=" << len << ")";
> This seems to be specific only for ORC. Therefore, I decide to restore this
Done


http://gerrit.cloudera.org:8080/#/c/15370/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2130
PS14, Line 2130: 
> It is probably safe to set hasOrc=false here if ORC_ASYNC_READ=false.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 04 Dec 2021 03:28:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#19) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 523 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/19
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h@127
PS19, Line 127: 
> I see no reason why set natural read size to 0 here. This must be a remnant
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Dec 2021 19:23:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7461/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 12:42:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 7:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@153
PS7, Line 153:   //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << "reserved: " << min_buffer_size * col_range_lengths.size();
line too long (138 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@158
PS7, Line 158: 	  LOG(INFO) << "col_range_lengths: " << col_range_lengths[i];
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@160
PS7, Line 160:  
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@172
PS7, Line 172: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@172
PS7, Line 172: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << 
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@203
PS7, Line 203: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << " bytes to add " << bytes_to_add;
line too long (115 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@203
PS7, Line 203: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << " bytes to add " << bytes_to_add;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-columnar-scanner.cc@209
PS7, Line 209: 	  LOG(INFO) << "column reservation: " << tmp_reservation.second;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@112
PS7, Line 112: 	  //LOG(INFO) << "Read random from orc. offset: " << offset << " length: " << length;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@123
PS7, Line 123: 	  //LOG(INFO) << "Read async orc. offset: " << offset << " length: " << length;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@191
PS7, Line 191:         unique_ptr<orc::StreamInformation> stream = stripe.getStreamInformation(stream_id);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@262
PS7, Line 262: 	DCHECK(false);
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@273
PS7, Line 273: 		  return status;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@274
PS7, Line 274: 	  }
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@275
PS7, Line 275: 	  //LOG(INFO) << "HdfsOrcScanner::ColumnRange::read skipping: " << (offset - position_);
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/7/be/src/exec/hdfs-orc-scanner.cc@288
PS7, Line 288: 	//LOG(INFO) << "HdfsOrcScanner::ColumnRange::read stream finished: ";
tab used for whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 12:40:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Reviewed-on: http://gerrit.cloudera.org:8080/15370
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-index.cc
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/request-ranges.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
19 files changed, 720 insertions(+), 279 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 32
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 30:

looked a bit into
https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/5231/
but I couldn't find why all these testse failed. Restarting the tests


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 30
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 09 Feb 2022 14:44:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 30: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7817/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 30
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 Feb 2022 12:17:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#26) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-index.cc
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/request-ranges.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
19 files changed, 723 insertions(+), 281 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/26
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 27:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10090/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 27
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 03 Feb 2022 02:18:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Kurt Deschler (Code Review)" <ge...@cloudera.org>.
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 15:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.h@87
PS15, Line 87:       vector<std::pair<int, int64_t>>& reservation_per_column);
Add typedef/using for template types here and below.


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.cc@151
PS15, Line 151:         return left_len != right_len ? left_len > right_len : left.first < right.first;
Parenthesize conditions for readability


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.h@95
PS15, Line 95:     uint64_t offset_;
Use signed types per style guide?


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.cc@111
PS15, Line 111:   if (columnRange == NULL) {
nullptr


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.cc@309
PS15, Line 309:   for (ColumnRange& it : columnRanges_) {
Can lower_bound be used here?


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/parquet/hdfs-parquet-scanner.cc@2881
PS15, Line 2881:   vector<pair<int, int64_t>> reservation_per_column;
Add typedef/using


http://gerrit.cloudera.org:8080/#/c/15370/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@385
PS15, Line 385:   private boolean hasOrc(Set<HdfsFileFormat> fileFormats) {
declare const



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Dec 2021 15:12:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h@127
PS19, Line 127: return 0;
I see no reason why set natural read size to 0 here. This must be a remnant from when we're debugging the stream position issue. Should restore it back to disk_io_mgr()->max_buffer_size().



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Dec 2021 18:29:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 12:

Patch set 12 has the following changes:
1. Remove several debugging logs.
2. Adjust resource-requirements.test (from PlannerTest.testResourceRequirements).
3. Add workaround to increase memory reservation for certain select count cases.

Following queries failed without workaround from point 3:
select count(*) from complextypes_partitioned.int_array;
select count(*) from complextypes_partitioned.nested_struct.c.d.item inner_array;


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Nov 2021 21:54:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 19:

(1 comment)

Patch set 19 add typedef ColumnRangeLengths.

http://gerrit.cloudera.org:8080/#/c/15370/18/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/18/be/src/exec/hdfs-columnar-scanner.h@89
PS18, Line 89:   /// current columns being scanned. Sets the reservation on each corresponding reader
> Add typedef for ColumnRangeLengths and use consistently in signatures.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Dec 2021 18:34:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 22:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@484
PS21, Line 484: 
> I believe it is unique_ptr here because orc::createReader first parameter a
Ah, yeah, that's a problem. Then we need to make sure 'input_stream_' won't be used if the corresponding 'reader_' is destroyed. I don't see any other usages of 'input_stream_' outside this method. Can we just make it a local variable?


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@114
PS22, Line 114: VLOG_FILE
Could you change this to warning or use VLOG_QUERY? Otherwise it's hard to notice it in practise.


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@128
PS22, Line 128:     string msg = Substitute("Invalid read len on ORC file $0.", filename_);
Can we report the offset and length here?


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@137
PS22, Line 137:   // Set expected_local to false to avoid cache on stale data (IMPALA-6830)
              :   bool expected_local = false;
Not related to this patch, but I think we need to revisit this as IMPALA-6830 is not an issue.


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@238
PS22, Line 238: range.length_
Shouldn't this be "range.offset_ + range.length_"?


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@261
PS22, Line 261:   if (offset < current_position_) {
              :     DCHECK(false);
              :     string msg = Substitute(
              :         "ORC read request to already read range. offset: $0 length: $1 pos: $2 $3",
              :         offset, length, current_position_, debug());
              :     return Status(msg);
              :   }
I think we can change this to DCHECK(offset >= current_position_) now, since we've moved the check to line 113.


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@287
PS22, Line 287: // TODO: extend Orc interface to avoid the copy
Could you create a JIRA for this?



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jan 2022 03:06:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#24) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 551 insertions(+), 229 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/24
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#21) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 543 insertions(+), 229 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/21
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 6: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7455/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 07 Sep 2021 17:00:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7454/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Mon, 06 Sep 2021 12:47:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 22:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@165
PS21, Line 165:     case orc::StreamKind_DICTIONARY_COUNT:
> After confirming with Csaba, the real reason is that we expect that indexes
Done


http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@252
PS21, Line 252: }
              : 
              : Status HdfsOrcScanner::ColumnRange::read(void* buf, uint64_t length, uint64_t offset) {
              :   if (offset + length > offset_ + length_) {
              :     string msg = Substitute("ORC read request out of range. offset: $0 length: $1 $2",
              :         offset, length, debug());
              :    
> As we recently figure out, there is a valid case where ORC lib decide to se
Added one more case in HdfsOrcScanner::ScanRangeInputStream::read.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Jan 2022 03:19:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 21:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@165
PS21, Line 165:       return false;
> Why we exclude these index streams? They will be read when we add SearchArg
We might exclude the index streams here because we're not sure about their read pattern.
For ColumnRange standpoint, we expect that streams eligible for async io is only read forward (never seek backward).
I will double check about this with Csaba.


http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@484
PS21, Line 484:     unique_ptr<orc::InputStream> input_stream(input_stream_);
> Should we use shared_ptr instead?
I believe it is unique_ptr here because orc::createReader first parameter ask for unique_ptr<InputStream> type.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 Jan 2022 18:36:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 19:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9906/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Dec 2021 18:53:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 30: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 30
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 Feb 2022 05:46:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#25) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-index.cc
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
18 files changed, 698 insertions(+), 265 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/25
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h@127
PS19, Line 127: lib.
> My memories are a bit hazy, but using anything but 0 caused some issues, pr
Ah, thanks Csaba! I agree with you after reading the code. We should let ORC lib to decide the appropriate size.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: ReadFooterStream
> I don't get the logic of this function - do we assume that the footer range
Each subclass of HdfsScanner will have their initial stream_ assigned in HdfsScanner::Open()
https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-scanner.cc#L78

This initial stream_ came from IssueInitialScanRange, which in the case of HdfsOrcScanner came from IssueFooterRanges here:
https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-orc-scanner.cc#L55

This implies that the footer range (the last 100KB, not the first) is already loaded in memory, within the initial steam_.
ORC lib will try to read the file footer first to figure out the file layout. So I add logic in HdfsOrcScanner::ScanRangeInputStream::read() to check if the requested [offset,offset+length) IsInFooterRange or not.
If true, serve the read from stream_.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1377
PS25, Line 1377:   if (offset > stream_->file_offset()) {
> Are we prepared for jumping backward instead of forward? Or this is impossi
We found out there is a corner case where ORC lib might want to seek backward a bit (it is a bit complicated case of overlapping range during decompression).
We're looking to improve this in OrcLib by eliminating the "seek backward a bit" piece. But in the meantime, we will always fallback to readRandom whenever the range ORC lib ask is behind the stream's current position.
Similarly for this footer stream, IsInFooterRange will return false if the asked range is behind the current stream_->file_offset(), and it will fallback to readRandom.
https://gerrit.cloudera.org/c/15370/25/be/src/exec/hdfs-orc-scanner.h#397



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 31 Jan 2022 19:00:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(8 comments)

> Patch Set 25:
> 
> (1 comment)
> 
> Patch set 25 add 7 counters and move 3 common counters between HdfsOrcScanner and HdfsParquetScanner into HdfsColumnarScanner.
> 
> This patch set also add async io if requested range falls in the footer stream range (initial scan range issued by IssueInitialRanges).

Thanks for adding the optimization on footer scan range! I think it really boost the performance on small files, e.g. tpcds_orc_def.store_sales has many files smaller than 100KB.

Finish reviewing the reservation part. I can give my +1 after the comments are resolved.

http://gerrit.cloudera.org:8080/#/c/15370/25//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15370/25//COMMIT_MSG@24
PS25, Line 24: relies on the backend to divide them as
             : needed.
I think we can do this in the orc-scanner as well. There are some APIs like orc::Reader::getMemoryUseByTypeId which can be used after the scanner parses the footer:
https://github.com/apache/orc/blob/rel/release-1.7.0/c%2B%2B/include/orc/Reader.hh#L503

Could you create a follow-up JIRA for this?


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2106
PS25, Line 2106: || format == HdfsFileFormat.ORC
Should we check the orc_async_read flag here? I think this causes the estimated reservation changes in resource-requirements.test


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2143
PS25, Line 2143: We may need to adjust this
               :    * logic for nested types in non-shredded columnar formats (e.g. IMPALA-6503 - ORC)
               :    * if/when that is added.
nit: Could you update this?


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2161
PS25, Line 2161: Long
nit: long


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2181
PS25, Line 2181: Long
nit: long


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2199
PS25, Line 2199: the current ORC scanner does not have the select
               :         // count(*) optimization yet like in Parquet.
Isn't this the optimization for count(*)? https://github.com/apache/impala/blob/c127b6b1a72d83778378dc181d9328b4cb7dea9e/be/src/exec/hdfs-orc-scanner.cc#L610-L630


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2274
PS25, Line 2274: , boolean orcAsyncRead
not used?


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2287
PS25, Line 2287:         // Estimate the data size with dictionary compression, assuming all distinct
               :         // values occur in the scan range with the largest number of rows and that each
               :         // value can be represented with approximately log2(ndv) bits. Even if Parquet
               :         // dictionary compression does not kick in, general-purpose compression
               :         // algorithms like Snappy can often find redundancy when there are repeated
               :         // values.
               :         long dictBytes = (long)(stats.getAvgSize() * stats.getNumDistinctValues());
               :         long bitsPerVal = BitUtil.log2Ceiling(stats.getNumDistinctValues());
               :         long encodedDataBytes = bitsPerVal * maxScanRangeNumRows_ / 8;
               :         reservationBytes = Math.min(reservationBytes, dictBytes + encodedDataBytes);
We probably need to review this for ORC. We can create a follow up JIRA if this estimation is still ok (i.e. if it doesn't break anything).



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sun, 30 Jan 2022 08:39:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15370/25//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15370/25//COMMIT_MSG@24
PS25, Line 24: relies on the backend to divide them as
             : needed.
> I think we can do this in the orc-scanner as well. There are some APIs like
Filed IMPALA-11099 for this.


http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2199
PS25, Line 2199: the current ORC scanner does not have the select
               :         // count(*) optimization yet like in Parquet.
> Isn't this the optimization for count(*)? https://github.com/apache/impala/
Select count(*) over nested column still need to read an ORC column, for example:

select count(*) from complextypes_partitioned.int_array

For this kind of query, that code region will not be evaluated since IsZeroSlotTableScan() == false (materialized_slots is empty, but tuple_desc()->tuple_path() is not empty).
Therefore, we still need to allocate memory to read column int_array in this example.

I check for parquet and it's select count optimization is not turned on in this example. returned columnByteSizes is also empty. I suppose it still read the column, but not doing it in sync manner. We can do it too if we want, skip async io if materialized_slots is empty.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 01 Feb 2022 04:40:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 28:

> Patch Set 28: Verified-1
> 
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7812/

Hit OOM in ubuntu-16.04-dockerised-tests at TestParquetBloomFilter.test_fallback_from_dict_if_no_bloom_tbl_props
https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/5225/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 28
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Feb 2022 17:22:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 28: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 28
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Feb 2022 01:14:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9794/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Nov 2021 16:44:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9804/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Nov 2021 17:22:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7462/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 18:50:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7455/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 07 Sep 2021 11:27:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9441/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 13:03:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 23:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10028/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jan 2022 17:56:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#23) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 552 insertions(+), 231 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/23
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#22) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 552 insertions(+), 229 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/22
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: ReadFooterStream
> It is still not 100% clear to me - so until now we were reading the last 10
You are correct. Until now, we were reading the last 100KB, but didn't actually use it to read metadata. Patch set 25 reuse this by serving the range through HdfsOrcScanner::ReadFooterStream() when possible.

You are also correct that ORC lib might move backward very early after reading the postscript to read the rest of footer region. The way ORC lib behave is described here:
https://github.com/apache/orc/blob/89af2cb/c%2B%2B/src/Reader.cc#L1352-L1376

ORC lib guess that footer+postscript should all contained within the last 16KB (DIRECTORY_SIZE_GUESS) of file, which it read in single stream read at line 1353.
If we're lucky that the 16KB tail already contain the footer as well, ORC lib will not issue the second stream read at line 1369 (which so far is true for small ORC files in tpcds_orc_def).
ReadFooterStream will not be used anymore after this 16KB read, since stream_ is deemed fully exhausted.
If we're unlucky that footer is not contained or only partially contained within that 16KB, the next footer read will need to be served by readRandom.

I suppose 16KB tail is a good guess for most ORC files. If not, then we should consider increasing DIRECTORY_SIZE_GUESS in ORC lib. On the other hand, we can also make initial scan range more efficient by lowering the FOOTER_SIZE from 100KB to 16KB for ORC.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 31 Jan 2022 23:31:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 24:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10037/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jan 2022 02:39:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#18) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 521 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/18
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9865/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Dec 2021 22:37:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 30:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7817/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 30
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 Feb 2022 05:46:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 31: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 31
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 09 Feb 2022 14:44:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 29: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 29
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Feb 2022 17:35:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 21:

(3 comments)

Patch set 21 add some more logging informations.

http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc@202
PS20, Line 202:     if (last_end > range.offset_) {
              :       string msg =
              :           Substitute("Ove
> Missing $1 in the string template.
Done


http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc@246
PS20, Line 246:   if (offset + length > offset_ + length_) {
              :     string msg = Substitute("ORC read request out of range. offset: $0 length: $1 $2",
              :         offset, length,
> Need to add filename here and some other site to help debugging if such err
Done


http://gerrit.cloudera.org:8080/#/c/15370/20/be/src/exec/hdfs-orc-scanner.cc@951
PS20, Line 951:         return parse_status_;
              :       } catch (std::exception& e) {
> Add filename information in this error logging.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Jan 2022 19:40:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 28:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7812/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 28
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 07 Feb 2022 01:15:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 27: Code-Review+2

(1 comment)

Thank Csaba and Riza for working on this! The patch LGTM. Could you please rebase it?

http://gerrit.cloudera.org:8080/#/c/15370/27/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/27/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2077
PS27, Line 2077:                        
nit: could you keep the original indention?



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 27
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 05 Feb 2022 01:41:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 29: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7815/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 29
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 Feb 2022 00:23:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 18:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9903/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 09 Dec 2021 22:40:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 17:

(1 comment)

Patch set 17 remove the reservation increment hack in hdfs-orc-scanner.cc and tighten up per-stream memory reservation in HdfsScanNode.java.

http://gerrit.cloudera.org:8080/#/c/15370/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/16/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2179
PS16, Line 2179:           appendMinColumnMemReservationsForComplexType(
> I think we can do better on counting min mem reservation here and other pla
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Dec 2021 20:44:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 17:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9897/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Dec 2021 20:46:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 27: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: stitute("HdfsOrc
> You are correct. Until now, we were reading the last 100KB, but didn't actu
Thanks for the explanation!

I would prefer to reduce the initial range size to 16KB (it is ok to move this to another patch).

It should be easy to do this by passing a size to  HdfsScanner::IssueFooterRanges instead of using constant: https://github.com/apache/impala/blob/57982efc21746f6994c11b623fc3cdd1dbbac8a2/be/src/exec/hdfs-scanner.cc#L832

We don't just read something and never use it, but also waste the data cache:
https://github.com/apache/impala/blob/master/be/src/runtime/io/data-cache.h#L73



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 27
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 03 Feb 2022 08:22:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.h@393
PS25, Line 393: is within footer stream
clarification idea: "within the remaining part of the footer stream"?


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: ReadFooterStream
> Each subclass of HdfsScanner will have their initial stream_ assigned in Hd
It is still not 100% clear to me - so until now we were reading the last 100KB, but didn't actually use it to read metadata, and now we will reuse it, as long as the ORC lib reads the metadata from the start of the footer to the end?

What bugs me is that the ORC lib should start reading from the Postscript, which is at the very end of the file, and the rest of the metadata will be only read afterwards, so we will move backward very early. 
See https://orc.apache.org/specification/ORCv2/

I may misunderstand something, but my impression is that we could only avoid reading these things again by copying the whole initial range to memory and read from there if the offset + length falls in that range.

In Parquet this is handled by copying the whole initial range to a buffer:
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1655

then read few fixed sized things from the end of the buffer, finally read the rest of the metadata as a single big Thrift struct:
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1655
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1748

(+ I just realized that Parquet also "drops" the rest of the buffer, while it could easily contain other data, e.g. column indexes)


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1377
PS25, Line 1377:   if (offset > stream_->file_offset()) {
> We found out there is a corner case where ORC lib might want to seek backwa
ah, I misunderstood IsInFooterRange()



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 31 Jan 2022 21:03:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 13:

(4 comments)

Hi All,
David has ran several benchmark run. Perf number seems to improve from this async IO prototype.
I will proceed cleaning up the code and add proper commit message. Here are some that I plan to address next.

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h@123
PS13, Line 123:  // ExecEnv::GetInstance()->disk_io_mgr()->max_buffer_size();
Can be removed?


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc@300
PS13, Line 300:     // stream_->ReleaseCompletedResources(true);
              :     stream_->ReleaseCompletedResources(false);
Calling 'ReleaseCompletedResources(true)' seems to be OK here?


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc@821
PS13, Line 821:   // DCHECK_LE(offset + len, GetFileDesc(metadata->partition_id, file)->file_length)
              :   //    << "Scan range beyond end of file (offset=" << offset << ", len=" << len << ")";
Can be removed?


http://gerrit.cloudera.org:8080/#/c/15370/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2134
PS13, Line 2134:     for (SlotDescriptor slot: desc_.getSlots()) {
Just for our note, we found a corner case here for "select count(*)" kind of query over ORC.
Somehow, desc._getSlots() is empty in this corner case, but HdfsOrcScanner::StartColumnReading actually see couple streams that is eligible for async read.

Patch set 12 already adds a workaround within HdfsOrcScanner::StartColumnReading to TryIncreaseReservation 8KB (min_buffer_size) for each eligible stream. If it can't increase, then the rest of the stream will be read synchronously. I will file a follow up JIRA to document this situation.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Nov 2021 23:17:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15370

to look at the new patch set (#3).

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
11 files changed, 395 insertions(+), 181 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/3
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7448/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 14:30:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9425/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 07 Sep 2021 11:49:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 4: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7448/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 16:15:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 9:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@153
PS9, Line 153:   //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << "reserved: " << min_buffer_size * col_range_lengths.size();
line too long (138 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@158
PS9, Line 158: 	  LOG(INFO) << "col_range_lengths: " << col_range_lengths[i];
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@160
PS9, Line 160:  
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@172
PS9, Line 172: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@172
PS9, Line 172: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << 
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@203
PS9, Line 203: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << " bytes to add " << bytes_to_add;
line too long (115 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@203
PS9, Line 203: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << " bytes to add " << bytes_to_add;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-columnar-scanner.cc@209
PS9, Line 209: 	  LOG(INFO) << "column reservation: " << tmp_reservation.second;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@112
PS9, Line 112:     
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@113
PS9, Line 113: 	  LOG(INFO) << "Read random from orc. offset: " << offset << " length: " << length;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@124
PS9, Line 124: 	  LOG(INFO) << "Read async orc. offset: " << offset << " length: " << length;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@147
PS9, Line 147:   
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@203
PS9, Line 203:         unique_ptr<orc::StreamInformation> stream = stripe.getStreamInformation(stream_id);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@278
PS9, Line 278: 	DCHECK(false);
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@289
PS9, Line 289: 		  return status;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@290
PS9, Line 290: 	  }
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@291
PS9, Line 291: 	  //LOG(INFO) << "HdfsOrcScanner::ColumnRange::read skipping: " << (offset - position_);
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-orc-scanner.cc@304
PS9, Line 304: 	//LOG(INFO) << "HdfsOrcScanner::ColumnRange::read stream finished: ";
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/15370/9/be/src/exec/hdfs-scan-node-base.cc@823
PS9, Line 823:   if (offset + len > GetFileDesc(metadata->partition_id, file)->file_length) return nullptr;
line too long (92 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 22:04:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15370

to look at the new patch set (#8).

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
11 files changed, 483 insertions(+), 191 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/8
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15370

to look at the new patch set (#6).

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
11 files changed, 395 insertions(+), 184 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/6
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 8:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@153
PS8, Line 153:   //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << "reserved: " << min_buffer_size * col_range_lengths.size();
line too long (138 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@158
PS8, Line 158: 	  LOG(INFO) << "col_range_lengths: " << col_range_lengths[i];
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@160
PS8, Line 160:  
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@172
PS8, Line 172: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@172
PS8, Line 172: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << 
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@203
PS8, Line 203: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << " bytes to add " << bytes_to_add;
line too long (115 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@203
PS8, Line 203: 	    //LOG(INFO) << "reservation_to_distribute: " << reservation_to_distribute << " bytes to add " << bytes_to_add;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-columnar-scanner.cc@209
PS8, Line 209: 	  LOG(INFO) << "column reservation: " << tmp_reservation.second;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@112
PS8, Line 112:     
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@113
PS8, Line 113: 	  //LOG(INFO) << "Read random from orc. offset: " << offset << " length: " << length;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@124
PS8, Line 124: 	  //LOG(INFO) << "Read async orc. offset: " << offset << " length: " << length;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@145
PS8, Line 145:   
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@198
PS8, Line 198:         unique_ptr<orc::StreamInformation> stream = stripe.getStreamInformation(stream_id);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@269
PS8, Line 269: 	DCHECK(false);
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@280
PS8, Line 280: 		  return status;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@281
PS8, Line 281: 	  }
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@282
PS8, Line 282: 	  //LOG(INFO) << "HdfsOrcScanner::ColumnRange::read skipping: " << (offset - position_);
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/15370/8/be/src/exec/hdfs-orc-scanner.cc@295
PS8, Line 295: 	//LOG(INFO) << "HdfsOrcScanner::ColumnRange::read stream finished: ";
tab used for whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 18:45:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15370

to look at the new patch set (#9).

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
12 files changed, 496 insertions(+), 194 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/9
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9257/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 09 Aug 2021 14:58:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 13:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-columnar-scanner.cc@239
PS13, Line 239:   columnar_scanner_actual_reservation_counter_->UpdateCounter(context_->total_reservation());
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc@208
PS13, Line 208:     columnar_scanner_actual_reservation_counter_->UpdateCounter(context_->total_reservation());
line too long (95 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 Nov 2021 17:02:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 10:

Patch set 10 is a rebase of patch set 9 over recent master branch.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Nov 2021 16:14:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 20:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9942/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Dec 2021 19:45:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 16:

(7 comments)

Thank you, Kurt!
Patch set 16 address some of your comments.

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.h@87
PS15, Line 87:   /// current columns being scanned. Sets the reservation on each corresponding reader
> Add typedef/using for template types here and below.
Done


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-columnar-scanner.cc@151
PS15, Line 151:         int64_t right_len = col_range_lengths[right.first];
> Parenthesize conditions for readability
Done


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.h@95
PS15, Line 95:     uint64_t offset_;
> Use signed types per style guide?
We use uint64_t in this struct to match with ScanRangeInputStream and orc::InputStream from ORC lib.
I think we should keep this unsigned. Otherwise, we need to keep doing signed to unsigned type casting in HdfsOrcScanner::ColumnRange::read.


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.cc@111
PS15, Line 111:   if (columnRange == nullptr) {
> nullptr
Done


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/hdfs-orc-scanner.cc@309
PS15, Line 309:   : HdfsColumnarScanner(scan_node, state),
> Can lower_bound be used here?
I change it to use std::find_if. I also inline it at hdfs-orc-scanner.h.


http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/15/be/src/exec/parquet/hdfs-parquet-scanner.cc@2881
PS15, Line 2881:   ColumnReservations reservation_per_column;
> Add typedef/using
Done


http://gerrit.cloudera.org:8080/#/c/15370/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/15/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@385
PS15, Line 385:   private boolean hasOrc(Set<HdfsFileFormat> fileFormats) {
> declare const
This is the frontend java code. So there is no const specifier in method declaration.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Dec 2021 20:39:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 14:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-columnar-scanner.cc@239
PS13, Line 239:   columnar_scanner_actual_reservation_counter_->UpdateCounter(
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h@123
PS13, Line 123: le_desc_->file_length;
> Can be removed?
Done


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc@208
PS13, Line 208: 
> line too long (95 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc@300
PS13, Line 300:   memcpy(buf, stream_buf, length); // TODO: extend Orc interface to avoid the copy
              :   current_position_ += length;
> Calling 'ReleaseCompletedResources(true)' seems to be OK here?
Done


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc@821
PS13, Line 821:   DCHECK_LE(offset + len, GetFileDesc(metadata->partition_id, file)->file_length)
              :       << "Scan range beyond end of file (offset=" << offset << ", len=" << len << ")";
> Can be removed?
This seems to be specific only for ORC. Therefore, I decide to restore this DCHECK and add the additional check in hdfs-orc-scanner.cc.


http://gerrit.cloudera.org:8080/#/c/15370/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/14/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2130
PS14, Line 2130:   private List<Long> computeMinColumnMemReservations(boolean hasOrc) {
It is probably safe to set hasOrc=false here if ORC_ASYNC_READ=false.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 03 Dec 2021 00:01:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 23:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@252
PS21, Line 252:         col_range_local, split_range->mtime(), BufferOpts(split_range->cache_options()));
              :     RETURN_IF_ERROR(
              :         context_->AddAndStartStream(scan_range, range.io_reservation, &range.stream_));
              :   }
              :   return Status::OK();
              : }
              : 
> Added one more case in HdfsOrcScanner::ScanRangeInputStream::read.
Done


http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@484
PS21, Line 484: Status HdfsOrcScanner::ProcessFileTail() {
> Ah, yeah, that's a problem. Then we need to make sure 'input_stream_' won't
Done


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@114
PS22, Line 114: uint64_t 
> Could you change this to warning or use VLOG_QUERY? Otherwise it's hard to 
Done


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@128
PS22, Line 128:   if (!status.ok()) throw ResourceError(status);
> Can we report the offset and length here?
Done


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@137
PS22, Line 137:     return Status(msg);
              :   }
> Not related to this patch, but I think we need to revisit this as IMPALA-68
I suppose we can match it with how we expect locality in case of ColumnRange::read?


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@238
PS22, Line 238: artition_id =
> Shouldn't this be "range.offset_ + range.length_"?
Thanks for catching this! Fixed it and move the logic as inline function IsExpectedLocal.


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@261
PS22, Line 261:     string msg = Substitute("ORC read request out of range. offset: $0 length: $1 $2",
              :         offset, length, debug());
              :     return Status(msg);
              :   }
              : 
              :   DCHECK(offset >= current_position_);
              :   S
> I think we can change this to DCHECK(offset >= current_position_) now, sinc
Done


http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@287
PS22, Line 287: 
> Could you create a JIRA for this?
I found ORC-262 already describe this problem. I add this to the comment.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jan 2022 17:41:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#27) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-index.cc
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/request-ranges.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
19 files changed, 723 insertions(+), 281 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/27
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 27
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 31:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7824/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 31
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 09 Feb 2022 14:44:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9444/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 19:08:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Quanlong Huang, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15370

to look at the new patch set (#7).

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
11 files changed, 476 insertions(+), 191 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/7
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9446/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 22:56:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#20) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Planner calculates the memory reservation differently between async
Parquet and async ORC. In async Parquet, the planner calculates the
column memory reservation and relies on the backend to divide them as
needed. In async ORC, the planner needs to split the column's memory
reservation based on the estimated number of streams for that column
type. For example, a string column with a 4MB memory estimate will need
to split that estimate into four 1MB because it might use dictionary
encoding with four streams (PRESENT, DATA, DICTIONARY_DATA, and LENGTH
stream). This splitting is required because each async IO stream needs
to start with an 8KB (min_buffer_size) initial memory reservation.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+----------------------+------------------+------------------+
| Total time           | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |          3511075 |          3484736 |
| DISABLE_DATA_CACHE=1 |          5243337 |          4370095 |
+----------------------+------------------+------------------+

+----------------------+------------------+------------------+
| Geomean              | ORC_ASYNC_READ=0 | ORC_ASYNC_READ=1 |
+----------------------+------------------+------------------+
| DISABLE_DATA_CACHE=0 |      12786.58042 |      12454.80365 |
| DISABLE_DATA_CACHE=1 |      23081.10888 |      16692.31512 |
+----------------------+------------------+------------------+

Testing:
- Pass core tests.
- Pass core e2e tests with ORC_ASYNC_READ=1.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 522 insertions(+), 224 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/20
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9793/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Nov 2021 16:34:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 11:

Patch set 11 fix stream positioning bug and resolve some e2e tests failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Nov 2021 16:24:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 31: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 31
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 09 Feb 2022 21:20:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9799/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Nov 2021 22:12:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has uploaded a new patch set (#12) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................

WIP IMPALA-6636: Use async IO in ORC scanner

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
14 files changed, 475 insertions(+), 213 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/12
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 15:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9875/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Sat, 04 Dec 2021 03:46:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 18:

Patch set 18 clarifies comment in HdfsScanNode.java


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 09 Dec 2021 22:20:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 23:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/22/be/src/exec/hdfs-orc-scanner.cc@137
PS22, Line 137:     return Status(msg);
              :   }
> I suppose we can match it with how we expect locality in case of ColumnRang
Yeah, let's see if any tests fail


http://gerrit.cloudera.org:8080/#/c/15370/23/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/23/be/src/exec/hdfs-orc-scanner.cc@113
PS23, Line 113: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, uint64_t length,
It'd be helpful if we have profile counters like

* #read-requests and #bytes served by async-io
* #read-requests and #bytes served by sync-io
* total #read-requests and #bytes
* #bytes skipped in ColumnRange::read



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 24 Jan 2022 11:46:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 21:

(2 comments)

Sorry to be late here.. The patch looks pretty good! Start looking into more details.

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@165
PS21, Line 165:       return false;
Why we exclude these index streams? They will be read when we add SearchArguments(i.e. pushdown predicates). Could you add some comments?


http://gerrit.cloudera.org:8080/#/c/15370/21/be/src/exec/hdfs-orc-scanner.cc@484
PS21, Line 484:     unique_ptr<orc::InputStream> input_stream(input_stream_);
Should we use shared_ptr instead?



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Mon, 17 Jan 2022 09:29:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 22:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10021/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Jan 2022 03:40:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(1 comment)

Patch set 25 add 7 counters and move 3 common counters between HdfsOrcScanner and HdfsParquetScanner into HdfsColumnarScanner.

This patch set also add async io if requested range falls in the footer stream range (initial scan range issued by IssueInitialRanges).

http://gerrit.cloudera.org:8080/#/c/15370/23/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/23/be/src/exec/hdfs-orc-scanner.cc@113
PS23, Line 113: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, uint64_t length,
> It'd be helpful if we have profile counters like
Added in patch set 25.



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Jan 2022 05:46:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP IMPALA-6636: Use async IO in ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 7: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7461/


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 09 Sep 2021 18:47:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2199
PS25, Line 2199: the current ORC scanner does not have the select
               :         // count(*) optimization yet like in Parquet.
> Select count(*) over nested column still need to read an ORC column, for ex
typo, I mean "not doing it in async manner".



-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Tue, 01 Feb 2022 04:50:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Posted by "Riza Suminto (Code Review)" <ge...@cloudera.org>.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 27:

Patch set 27 is a rebase.


-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 27
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>
Gerrit-Comment-Date: Thu, 03 Feb 2022 01:56:48 +0000
Gerrit-HasComments: No