You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org> on 2020/06/01 18:21:24 UTC

[Impala-ASF-CR] IMPALA-9655: Dynamic intra-node load balancing for HDFS scans

Hello Tim Armstrong, Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15926

to look at the new patch set (#9).

Change subject: IMPALA-9655: Dynamic intra-node load balancing for HDFS scans
......................................................................

IMPALA-9655: Dynamic intra-node load balancing for HDFS scans

This patch ameliorates intra node execution skew for multithreaded
HDFS scans by implementing a shared queue of scan ranges for all
instances. Some points to highlight:
- The scan node's PlanNode will go through all the TScanRanges
  assigned to all instances and aggregate them into file descriptors.
- These files will be statically and evenly distributed among the
  instances.
- The instances would then pick up their set of files and issue
  initial ranges by adding them to a shared queue.
- Instances would then fetch their next range to read from this
  shared queue.
- Other relevant data structures that will also be shared are:
 * remaining_scan_range_submissions_
 * partition_template_tuple_map_
 * per_file_metadata_
- Removed the longest-processing time (LPT) algorithm in the scheduler
  which tries to distribute scan load among instances on a host. This
  will have no effect after this patch since now the ranges are shared.
- Added missing lifecycle events from MT scan nodes.

This approach guarantees the following:
- All shared data structures are allocated at the FragmentState
  level ensuring they'll stick around till the query is closed.
- All instance local buffers for reading a scan range will be
  allocated after it has been taken off the shared queue.
- Since the scheduler can assign ranges from the same file to
  different instances, aggregating them into file descriptors early
  will ensure the initial footer or header range (depending on file
  format) for it will only be issued once.

Limitation:
- For HDFS MT scans we lose out on the optimization within
  ReaderContext which ensures cached ranges are read first.
- As a compromise, ranges that are marked to use the hdfs
  cache are added to the front of the shared queue.

Testing:
- Added a regression test in test_mt_dop which would almost always
  fail without this patch
- Added some test coverage for mt scans where parquet files are split
  across blocks and scanning tables with mixed file formats.
- Ran core tests on ASAN build
- Ran exhaustive tests

Performance:

No significant perf change.
Ran TPCH (scale 30) with mt_dop set to 4 on a 3 node minicluster on my
desktop.

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 5.85    | +0.35%     | 4.35       | +0.60%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| TPCH(30) | TPCH-Q15 | parquet / none / none | 4.46   | 4.30        |   +3.63%   |   1.64%   |   2.23%        | 30    |   +3.56%       | 5.28    | 7.09  |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 9.51   | 9.23        |   +3.12%   |   1.19%   |   0.79%        | 30    |   +2.84%       | 6.48    | 11.69 |
| TPCH(30) | TPCH-Q2  | parquet / none / none | 2.37   | 2.32        |   +1.89%   |   3.76%   |   3.26%        | 30    |   +2.16%       | 1.97    | 2.06  |
| TPCH(30) | TPCH-Q9  | parquet / none / none | 16.84  | 16.55       |   +1.77%   |   1.24%   |   0.57%        | 30    |   +1.62%       | 5.75    | 6.99  |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 3.65   | 3.60        |   +1.43%   |   2.19%   |   1.72%        | 30    |   +1.39%       | 2.53    | 2.79  |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 12.64  | 12.48       |   +1.33%   |   1.51%   |   1.53%        | 30    |   +1.28%       | 2.67    | 3.36  |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 2.94   | 2.91        |   +1.03%   |   1.91%   |   1.77%        | 30    |   +1.48%       | 1.98    | 2.16  |
| TPCH(30) | TPCH-Q6  | parquet / none / none | 1.31   | 1.29        |   +1.44%   |   5.48%   |   3.66%        | 30    |   +0.43%       | 1.10    | 1.18  |
| TPCH(30) | TPCH-Q7  | parquet / none / none | 5.49   | 5.44        |   +0.78%   |   0.59%   |   1.16%        | 30    |   +0.95%       | 3.49    | 3.27  |
| TPCH(30) | TPCH-Q8  | parquet / none / none | 5.40   | 5.35        |   +0.83%   |   0.95%   |   0.95%        | 30    |   +0.83%       | 2.81    | 3.37  |
| TPCH(30) | TPCH-Q20 | parquet / none / none | 2.82   | 2.81        |   +0.50%   |   1.58%   |   1.72%        | 30    |   +0.30%       | 1.22    | 1.16  |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.43   | 1.42        |   +0.64%   |   2.42%   |   1.96%        | 30    |   +0.09%       | 0.68    | 1.13  |
| TPCH(30) | TPCH-Q4  | parquet / none / none | 2.59   | 2.58        |   +0.28%   |   1.54%   |   1.99%        | 30    |   +0.23%       | 1.17    | 0.60  |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 2.17   | 2.17        |   +0.15%   |   3.04%   |   3.68%        | 30    |   +0.15%       | 0.30    | 0.18  |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.02   | 2.02        |   +0.21%   |   2.73%   |   2.85%        | 30    |   +0.07%       | 0.26    | 0.30  |
| TPCH(30) | TPCH-Q5  | parquet / none / none | 4.16   | 4.15        |   +0.04%   |   4.49%   |   4.38%        | 30    |   -0.03%       | -0.13   | 0.03  |
| TPCH(30) | TPCH-Q3  | parquet / none / none | 4.75   | 4.76        |   -0.28%   |   1.34%   |   1.47%        | 30    |   -0.10%       | -0.57   | -0.77 |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 4.92   | 4.95        |   -0.52%   |   1.45%   |   2.71%        | 30    |   +0.03%       | 0.12    | -0.94 |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.82   | 2.85        |   -0.78%   |   1.50%   |   1.68%        | 30    |   -0.65%       | -1.66   | -1.91 |
| TPCH(30) | TPCH-Q1  | parquet / none / none | 4.49   | 4.53        |   -0.97%   |   2.26%   |   2.69%        | 30    |   -0.68%       | -0.99   | -1.52 |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 10.11  | 10.19       |   -0.85%   |   3.27%   |   3.71%        | 30    |   -0.84%       | -0.85   | -0.94 |
| TPCH(30) | TPCH-Q21 | parquet / none / none | 21.75  | 22.28       |   -2.37%   |   3.51%   |   7.00%        | 30    |   -0.88%       | -1.35   | -1.66 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+

Change-Id: I9a101d0d98dff6e3779f85bc466e4c0bdb38094b
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scan-node-mt.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M tests/query_test/test_mt_dop.py
M tests/query_test/test_scanners.py
19 files changed, 702 insertions(+), 415 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/15926/9
-- 
To view, visit http://gerrit.cloudera.org:8080/15926
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9a101d0d98dff6e3779f85bc466e4c0bdb38094b
Gerrit-Change-Number: 15926
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>