You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org> on 2020/06/01 18:21:24 UTC
[Impala-ASF-CR] IMPALA-9655: Dynamic intra-node load balancing for HDFS scans
Hello Tim Armstrong, Joe McDonnell, Csaba Ringhofer, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/15926
to look at the new patch set (#9).
Change subject: IMPALA-9655: Dynamic intra-node load balancing for HDFS scans
......................................................................
IMPALA-9655: Dynamic intra-node load balancing for HDFS scans
This patch ameliorates intra node execution skew for multithreaded
HDFS scans by implementing a shared queue of scan ranges for all
instances. Some points to highlight:
- The scan node's PlanNode will go through all the TScanRanges
assigned to all instances and aggregate them into file descriptors.
- These files will be statically and evenly distributed among the
instances.
- The instances would then pick up their set of files and issue
initial ranges by adding them to a shared queue.
- Instances would then fetch their next range to read from this
shared queue.
- Other relevant data structures that will also be shared are:
* remaining_scan_range_submissions_
* partition_template_tuple_map_
* per_file_metadata_
- Removed the longest-processing time (LPT) algorithm in the scheduler
which tries to distribute scan load among instances on a host. This
will have no effect after this patch since now the ranges are shared.
- Added missing lifecycle events from MT scan nodes.
This approach guarantees the following:
- All shared data structures are allocated at the FragmentState
level ensuring they'll stick around till the query is closed.
- All instance local buffers for reading a scan range will be
allocated after it has been taken off the shared queue.
- Since the scheduler can assign ranges from the same file to
different instances, aggregating them into file descriptors early
will ensure the initial footer or header range (depending on file
format) for it will only be issued once.
Limitation:
- For HDFS MT scans we lose out on the optimization within
ReaderContext which ensures cached ranges are read first.
- As a compromise, ranges that are marked to use the hdfs
cache are added to the front of the shared queue.
Testing:
- Added a regression test in test_mt_dop which would almost always
fail without this patch
- Added some test coverage for mt scans where parquet files are split
across blocks and scanning tables with mixed file formats.
- Ran core tests on ASAN build
- Ran exhaustive tests
Performance:
No significant perf change.
Ran TPCH (scale 30) with mt_dop set to 4 on a 3 node minicluster on my
desktop.
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 5.85 | +0.35% | 4.35 | +0.60% |
+----------+-----------------------+---------+------------+------------+----------------+
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| TPCH(30) | TPCH-Q15 | parquet / none / none | 4.46 | 4.30 | +3.63% | 1.64% | 2.23% | 30 | +3.56% | 5.28 | 7.09 |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 9.51 | 9.23 | +3.12% | 1.19% | 0.79% | 30 | +2.84% | 6.48 | 11.69 |
| TPCH(30) | TPCH-Q2 | parquet / none / none | 2.37 | 2.32 | +1.89% | 3.76% | 3.26% | 30 | +2.16% | 1.97 | 2.06 |
| TPCH(30) | TPCH-Q9 | parquet / none / none | 16.84 | 16.55 | +1.77% | 1.24% | 0.57% | 30 | +1.62% | 5.75 | 6.99 |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 3.65 | 3.60 | +1.43% | 2.19% | 1.72% | 30 | +1.39% | 2.53 | 2.79 |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 12.64 | 12.48 | +1.33% | 1.51% | 1.53% | 30 | +1.28% | 2.67 | 3.36 |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 2.94 | 2.91 | +1.03% | 1.91% | 1.77% | 30 | +1.48% | 1.98 | 2.16 |
| TPCH(30) | TPCH-Q6 | parquet / none / none | 1.31 | 1.29 | +1.44% | 5.48% | 3.66% | 30 | +0.43% | 1.10 | 1.18 |
| TPCH(30) | TPCH-Q7 | parquet / none / none | 5.49 | 5.44 | +0.78% | 0.59% | 1.16% | 30 | +0.95% | 3.49 | 3.27 |
| TPCH(30) | TPCH-Q8 | parquet / none / none | 5.40 | 5.35 | +0.83% | 0.95% | 0.95% | 30 | +0.83% | 2.81 | 3.37 |
| TPCH(30) | TPCH-Q20 | parquet / none / none | 2.82 | 2.81 | +0.50% | 1.58% | 1.72% | 30 | +0.30% | 1.22 | 1.16 |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.43 | 1.42 | +0.64% | 2.42% | 1.96% | 30 | +0.09% | 0.68 | 1.13 |
| TPCH(30) | TPCH-Q4 | parquet / none / none | 2.59 | 2.58 | +0.28% | 1.54% | 1.99% | 30 | +0.23% | 1.17 | 0.60 |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 2.17 | 2.17 | +0.15% | 3.04% | 3.68% | 30 | +0.15% | 0.30 | 0.18 |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.02 | 2.02 | +0.21% | 2.73% | 2.85% | 30 | +0.07% | 0.26 | 0.30 |
| TPCH(30) | TPCH-Q5 | parquet / none / none | 4.16 | 4.15 | +0.04% | 4.49% | 4.38% | 30 | -0.03% | -0.13 | 0.03 |
| TPCH(30) | TPCH-Q3 | parquet / none / none | 4.75 | 4.76 | -0.28% | 1.34% | 1.47% | 30 | -0.10% | -0.57 | -0.77 |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 4.92 | 4.95 | -0.52% | 1.45% | 2.71% | 30 | +0.03% | 0.12 | -0.94 |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.82 | 2.85 | -0.78% | 1.50% | 1.68% | 30 | -0.65% | -1.66 | -1.91 |
| TPCH(30) | TPCH-Q1 | parquet / none / none | 4.49 | 4.53 | -0.97% | 2.26% | 2.69% | 30 | -0.68% | -0.99 | -1.52 |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 10.11 | 10.19 | -0.85% | 3.27% | 3.71% | 30 | -0.84% | -0.85 | -0.94 |
| TPCH(30) | TPCH-Q21 | parquet / none / none | 21.75 | 22.28 | -2.37% | 3.51% | 7.00% | 30 | -0.88% | -1.35 | -1.66 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
Change-Id: I9a101d0d98dff6e3779f85bc466e4c0bdb38094b
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scan-node-mt.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M be/src/scheduling/scheduler.h
M tests/query_test/test_mt_dop.py
M tests/query_test/test_scanners.py
19 files changed, 702 insertions(+), 415 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/15926/9
--
To view, visit http://gerrit.cloudera.org:8080/15926
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9a101d0d98dff6e3779f85bc466e4c0bdb38094b
Gerrit-Change-Number: 15926
Gerrit-PatchSet: 9
Gerrit-Owner: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>