You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/09/07 17:16:00 UTC
[jira] [Commented] (IMPALA-11539) Mitigate intra-node skew of HDFS scans with MT_DOP
[ https://issues.apache.org/jira/browse/IMPALA-11539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601417#comment-17601417 ]
ASF subversion and git services commented on IMPALA-11539:
----------------------------------------------------------
Commit cf1d6d0f4ed4e6eecaa951ca3561a7f441ef5099 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=cf1d6d0f4 ]
IMPALA-11539: Mitigate intra-node skew of file scans with MT_DOP
Before IMPALA-9655 scan ranges were statically assigned to intra-node
fragment instances based on Longest-Processing Time algorithm:
https://github.com/apache/impala/blame/a7866a94578be6289bbac31686de4d9032ad9261/be/src/scheduling/scheduler.cc#L499-L501
From IMPALA-9655 we use dynamic intra-node load balancing for file
scans. It means fragment instances have a shared queue of scan ranges
and the fragment instances grab the next scan range to be read from
this queue.
IMPALA-9655 got rid of the LPT-algorithm which means the scan ranges
are in a random order in the queue. This can lead to a skew if there
are large scan ranges at the end.
This patch mixes the above two approaches by using a priority queue
for the scan ranges, so each fragment instance would grab the largest
scan range in the queue. This could further mitigate intra-node skewing.
Ranges that are marked to use the hdfs cache are still handled with
higher priority.
The patch intoduces a new class called ScanRangeQueueMt which implements
the above.
Testing:
* added e2e test in which MIN(bytes_read) / MAX(bytes_read) is greater
then 0.6 with this patch. Earlier we could see it less than 0.4.
Performance:
No significant perf change. Ran TPCH (scale 30) with mt_dop set to 10 on
a 3 node minicluster on my desktop.
+----------+------------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+------------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / snap / block | 3.16 | -0.37% | 2.29 | +0.40% |
+----------+------------------------+---------+------------+------------+----------------+
+----------+----------+------------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+----------+----------+------------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| TPCH(30) | TPCH-Q20 | parquet / snap / block | 1.58 | 1.55 | +2.04% | 2.26% | 3.15% | 20 | +3.14% | 2.53 | 2.34 |
| TPCH(30) | TPCH-Q11 | parquet / snap / block | 0.68 | 0.66 | +2.94% | 3.67% | 4.18% | 20 | +0.90% | 1.97 | 2.33 |
| TPCH(30) | TPCH-Q5 | parquet / snap / block | 2.02 | 1.99 | +1.55% | 1.48% | 1.38% | 20 | +2.27% | 2.53 | 3.40 |
| TPCH(30) | TPCH-Q1 | parquet / snap / block | 2.59 | 2.56 | +1.22% | 3.25% | 3.12% | 20 | +1.75% | 1.33 | 1.21 |
| TPCH(30) | TPCH-Q15 | parquet / snap / block | 2.15 | 2.13 | +1.20% | 1.53% | 1.39% | 20 | +0.91% | 2.38 | 2.58 |
| TPCH(30) | TPCH-Q4 | parquet / snap / block | 1.26 | 1.24 | +1.62% | 2.33% | 2.77% | 20 | +0.36% | 2.23 | 1.98 |
| TPCH(30) | TPCH-Q6 | parquet / snap / block | 0.74 | 0.73 | +1.31% | 3.23% | 3.24% | 20 | +0.54% | 1.24 | 1.27 |
| TPCH(30) | TPCH-Q7 | parquet / snap / block | 2.13 | 2.11 | +1.03% | 1.84% | 1.71% | 20 | +0.52% | 1.85 | 1.83 |
| TPCH(30) | TPCH-Q19 | parquet / snap / block | 2.02 | 2.00 | +0.91% | 1.26% | 1.32% | 20 | +0.34% | 1.80 | 2.22 |
| TPCH(30) | TPCH-Q2 | parquet / snap / block | 1.42 | 1.41 | +0.95% | 2.54% | 1.58% | 20 | +0.07% | 0.42 | 1.41 |
| TPCH(30) | TPCH-Q16 | parquet / snap / block | 0.93 | 0.92 | +0.59% | 4.06% | 4.53% | 20 | +0.27% | 0.66 | 0.43 |
| TPCH(30) | TPCH-Q12 | parquet / snap / block | 1.57 | 1.56 | +0.68% | 1.98% | 2.34% | 20 | +0.14% | 1.04 | 0.98 |
| TPCH(30) | TPCH-Q21 | parquet / snap / block | 10.69 | 10.66 | +0.28% | 0.54% | 0.59% | 20 | +0.36% | 1.56 | 1.59 |
| TPCH(30) | TPCH-Q8 | parquet / snap / block | 2.30 | 2.29 | +0.50% | 1.38% | 1.57% | 20 | +0.09% | 1.12 | 1.06 |
| TPCH(30) | TPCH-Q22 | parquet / snap / block | 1.01 | 1.00 | +0.51% | 3.40% | 4.25% | 20 | +0.08% | 0.60 | 0.42 |
| TPCH(30) | TPCH-Q13 | parquet / snap / block | 4.66 | 4.66 | +0.08% | 1.00% | 0.78% | 20 | -0.02% | -0.25 | 0.29 |
| TPCH(30) | TPCH-Q18 | parquet / snap / block | 6.26 | 6.25 | +0.13% | 2.23% | 1.49% | 20 | -0.07% | -0.16 | 0.21 |
| TPCH(30) | TPCH-Q14 | parquet / snap / block | 1.68 | 1.69 | -0.51% | 2.51% | 3.81% | 20 | -0.05% | -0.31 | -0.51 |
| TPCH(30) | TPCH-Q3 | parquet / snap / block | 1.86 | 1.88 | -0.72% | 1.83% | 2.10% | 20 | -0.07% | -0.95 | -1.15 |
| TPCH(30) | TPCH-Q9 | parquet / snap / block | 8.84 | 9.02 | -1.93% | 1.78% | 2.55% | 20 | -2.27% | -2.50 | -2.79 |
| TPCH(30) | TPCH-Q17 | parquet / snap / block | 8.27 | 8.45 | -2.09% | 1.16% | 1.32% | 20 | -2.33% | -4.01 | -5.35 |
| TPCH(30) | TPCH-Q10 | parquet / snap / block | 4.95 | 5.13 | -3.60% | 6.86% | 9.75% | 20 | -2.53% | -2.70 | -1.37 |
+----------+----------+------------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
Change-Id: Ib7dc1f1665565da6c0e155c1e585f7089b18a180
Reviewed-on: http://gerrit.cloudera.org:8080/18929
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
> Mitigate intra-node skew of HDFS scans with MT_DOP
> --------------------------------------------------
>
> Key: IMPALA-11539
> URL: https://issues.apache.org/jira/browse/IMPALA-11539
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Major
>
> Before IMPALA-9655 scan ranges were statically assigned to intra-node fragment instances based on Longest-Processing Time algorithm:
> https://github.com/apache/impala/blame/a7866a94578be6289bbac31686de4d9032ad9261/be/src/scheduling/scheduler.cc#L499-L501
> From IMPALA-9655 we use dynamic intra-node load balancing for HDFS scans. It means fragment instances have a shared queue of scan ranges and the fragment instances grab the next scan range to be read from this queue.
> IMPALA-9655 got rid of the LPT-algorithm which means the scan ranges are in a random order in the queue. This can lead to a skew if there are large scan ranges at the end.
> We could mix the above two by using a priority queue for the scan ranges, so each fragment instance would grab the largest scan range in the queue. This could further mitigate intra-node skewing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org