You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/09/07 17:16:00 UTC
[jira] [Commented] (IMPALA-11539) Mitigate intra-node skew of HDFS scans with MT_DOP

    [ https://issues.apache.org/jira/browse/IMPALA-11539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601417#comment-17601417 ] 

ASF subversion and git services commented on IMPALA-11539:
----------------------------------------------------------

Commit cf1d6d0f4ed4e6eecaa951ca3561a7f441ef5099 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=cf1d6d0f4 ]

IMPALA-11539: Mitigate intra-node skew of file scans with MT_DOP

Before IMPALA-9655 scan ranges were statically assigned to intra-node
fragment instances based on Longest-Processing Time algorithm:
https://github.com/apache/impala/blame/a7866a94578be6289bbac31686de4d9032ad9261/be/src/scheduling/scheduler.cc#L499-L501

From IMPALA-9655 we use dynamic intra-node load balancing for file
scans. It means fragment instances have a shared queue of scan ranges
and the fragment instances grab the next scan range to be read from
this queue.

IMPALA-9655 got rid of the LPT-algorithm which means the scan ranges
are in a random order in the queue. This can lead to a skew if there
are large scan ranges at the end.

This patch mixes the above two approaches by using a priority queue
for the scan ranges, so each fragment instance would grab the largest
scan range in the queue. This could further mitigate intra-node skewing.
Ranges that are marked to use the hdfs cache are still handled with
higher priority.

The patch intoduces a new class called ScanRangeQueueMt which implements
the above.

Testing:
 * added e2e test in which MIN(bytes_read) / MAX(bytes_read) is greater
   then 0.6 with this patch. Earlier we could see it less than 0.4.

Performance:
No significant perf change. Ran TPCH (scale 30) with mt_dop set to 10 on
a 3 node minicluster on my desktop.

+----------+------------------------+---------+------------+------------+----------------+
| Workload | File Format            | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+------------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / snap / block | 3.16    | -0.37%     | 2.29       | +0.40%         |
+----------+------------------------+---------+------------+------------+----------------+

+----------+----------+------------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| Workload | Query    | File Format            | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+----------+----------+------------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| TPCH(30) | TPCH-Q20 | parquet / snap / block | 1.58   | 1.55        |   +2.04%   |   2.26%   |   3.15%        | 20    |   +3.14%       | 2.53    | 2.34  |
| TPCH(30) | TPCH-Q11 | parquet / snap / block | 0.68   | 0.66        |   +2.94%   |   3.67%   |   4.18%        | 20    |   +0.90%       | 1.97    | 2.33  |
| TPCH(30) | TPCH-Q5  | parquet / snap / block | 2.02   | 1.99        |   +1.55%   |   1.48%   |   1.38%        | 20    |   +2.27%       | 2.53    | 3.40  |
| TPCH(30) | TPCH-Q1  | parquet / snap / block | 2.59   | 2.56        |   +1.22%   |   3.25%   |   3.12%        | 20    |   +1.75%       | 1.33    | 1.21  |
| TPCH(30) | TPCH-Q15 | parquet / snap / block | 2.15   | 2.13        |   +1.20%   |   1.53%   |   1.39%        | 20    |   +0.91%       | 2.38    | 2.58  |
| TPCH(30) | TPCH-Q4  | parquet / snap / block | 1.26   | 1.24        |   +1.62%   |   2.33%   |   2.77%        | 20    |   +0.36%       | 2.23    | 1.98  |
| TPCH(30) | TPCH-Q6  | parquet / snap / block | 0.74   | 0.73        |   +1.31%   |   3.23%   |   3.24%        | 20    |   +0.54%       | 1.24    | 1.27  |
| TPCH(30) | TPCH-Q7  | parquet / snap / block | 2.13   | 2.11        |   +1.03%   |   1.84%   |   1.71%        | 20    |   +0.52%       | 1.85    | 1.83  |
| TPCH(30) | TPCH-Q19 | parquet / snap / block | 2.02   | 2.00        |   +0.91%   |   1.26%   |   1.32%        | 20    |   +0.34%       | 1.80    | 2.22  |
| TPCH(30) | TPCH-Q2  | parquet / snap / block | 1.42   | 1.41        |   +0.95%   |   2.54%   |   1.58%        | 20    |   +0.07%       | 0.42    | 1.41  |
| TPCH(30) | TPCH-Q16 | parquet / snap / block | 0.93   | 0.92        |   +0.59%   |   4.06%   |   4.53%        | 20    |   +0.27%       | 0.66    | 0.43  |
| TPCH(30) | TPCH-Q12 | parquet / snap / block | 1.57   | 1.56        |   +0.68%   |   1.98%   |   2.34%        | 20    |   +0.14%       | 1.04    | 0.98  |
| TPCH(30) | TPCH-Q21 | parquet / snap / block | 10.69  | 10.66       |   +0.28%   |   0.54%   |   0.59%        | 20    |   +0.36%       | 1.56    | 1.59  |
| TPCH(30) | TPCH-Q8  | parquet / snap / block | 2.30   | 2.29        |   +0.50%   |   1.38%   |   1.57%        | 20    |   +0.09%       | 1.12    | 1.06  |
| TPCH(30) | TPCH-Q22 | parquet / snap / block | 1.01   | 1.00        |   +0.51%   |   3.40%   |   4.25%        | 20    |   +0.08%       | 0.60    | 0.42  |
| TPCH(30) | TPCH-Q13 | parquet / snap / block | 4.66   | 4.66        |   +0.08%   |   1.00%   |   0.78%        | 20    |   -0.02%       | -0.25   | 0.29  |
| TPCH(30) | TPCH-Q18 | parquet / snap / block | 6.26   | 6.25        |   +0.13%   |   2.23%   |   1.49%        | 20    |   -0.07%       | -0.16   | 0.21  |
| TPCH(30) | TPCH-Q14 | parquet / snap / block | 1.68   | 1.69        |   -0.51%   |   2.51%   |   3.81%        | 20    |   -0.05%       | -0.31   | -0.51 |
| TPCH(30) | TPCH-Q3  | parquet / snap / block | 1.86   | 1.88        |   -0.72%   |   1.83%   |   2.10%        | 20    |   -0.07%       | -0.95   | -1.15 |
| TPCH(30) | TPCH-Q9  | parquet / snap / block | 8.84   | 9.02        |   -1.93%   |   1.78%   |   2.55%        | 20    |   -2.27%       | -2.50   | -2.79 |
| TPCH(30) | TPCH-Q17 | parquet / snap / block | 8.27   | 8.45        |   -2.09%   |   1.16%   |   1.32%        | 20    |   -2.33%       | -4.01   | -5.35 |
| TPCH(30) | TPCH-Q10 | parquet / snap / block | 4.95   | 5.13        |   -3.60%   |   6.86%   |   9.75%        | 20    |   -2.53%       | -2.70   | -1.37 |
+----------+----------+------------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+

Change-Id: Ib7dc1f1665565da6c0e155c1e585f7089b18a180
Reviewed-on: http://gerrit.cloudera.org:8080/18929
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Mitigate intra-node skew of HDFS scans with MT_DOP
> --------------------------------------------------
>
>                 Key: IMPALA-11539
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11539
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>
> Before IMPALA-9655 scan ranges were statically assigned to intra-node fragment instances based on Longest-Processing Time algorithm:
> https://github.com/apache/impala/blame/a7866a94578be6289bbac31686de4d9032ad9261/be/src/scheduling/scheduler.cc#L499-L501
> From IMPALA-9655 we use dynamic intra-node load balancing for HDFS scans. It means fragment instances have a shared queue of scan ranges and the fragment instances grab the next scan range to be read from this queue.
> IMPALA-9655 got rid of the LPT-algorithm which means  the scan ranges are in a random order in the queue. This can lead to a skew if there are large scan ranges at the end.
> We could mix the above two by using a priority queue for the scan ranges, so each fragment instance would grab the largest scan range in the queue. This could further mitigate intra-node skewing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org