You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Riza Suminto (Code Review)" <ge...@cloudera.org> on 2021/12/02 22:15:03 UTC

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Riza Suminto has uploaded a new patch set (#14) to the change originally created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................

IMPALA-6636: Use async IO in ORC scanner

This patch implements async IO in the ORC scanner. For each ORC stripe,
we begin with iterating the column streams. If a column stream is
possible for async IO, it will create ColumnRange, register
ScannerContext::Stream for that ORC stream, and start the stream. We
modify HdfsOrcScanner::ScanRangeInputStream::read to check whether there
is a matching ColumnRange for the given offset and length. If so, the
reading continue through HdfsOrcScanner::ColumnRange::read.

We leverage existing async IO methods from HdfsParquetScanner class for
initial memory allocations. We moved related methods such as
DivideReservationBetweenColumns and ComputeIdealReservation up to
HdfsColumnarScanner class.

Currently, there are corner cases where planner might underestimate the
number of async IO stream for a table. A case like "select count(*)"
over complex type column might have empty desc._getSlots() in
HdfsScanNode.computeMinColumnMemReservations, but
HdfsOrcScanner::StartColumnReading later see couple streams that are
eligible for async IO. In this situation, HdfsOrcScanner will try to
increase reservation 8KB (min_buffer_size) for each eligible stream.
Once the reservation increment fails, it will read the rest of the
streams synchronously.

To show the improvement from ORC async IO, we contrast the total time
and geomean (in milliseconds) to run full TPC-DS 10 TB, 19 executors,
with varying ORC_ASYNC_IO and DISABLE_DATA_CACHE options as follow:

+--------------------------+----------------------+---------------------+
| Total time               | ORC_ASYNC_READ=false | ORC_ASYNC_READ=true |
+--------------------------+----------------------+---------------------+
| DISABLE_DATA_CACHE=false |              3511075 |             3484736 |
| DISABLE_DATA_CACHE=true  |              5243337 |             4370095 |
+--------------------------+----------------------+---------------------+

+--------------------------+----------------------+---------------------+
| Geomean                  | ORC_ASYNC_READ=false | ORC_ASYNC_READ=true |
+--------------------------+----------------------+---------------------+
| DISABLE_DATA_CACHE=false |          12786.58042 |         12454.80365 |
| DISABLE_DATA_CACHE=true  |          23081.10888 |         16692.31512 |
+--------------------------+----------------------+---------------------+

Testing:
- Pass core tests.

Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-page-reader.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/io/disk-io-mgr.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-query/queries/QueryTest/scanner-reservation.test
17 files changed, 495 insertions(+), 216 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/70/15370/14
-- 
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Riza Suminto <ri...@cloudera.com>