You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by "okumin (Jira)" <ji...@apache.org> on 2020/11/06 13:55:00 UTC
[jira] [Created] (TEZ-4246) Avoid uneven local disk usage for
spills
okumin created TEZ-4246:
---------------------------
Summary: Avoid uneven local disk usage for spills
Key: TEZ-4246
URL: https://issues.apache.org/jira/browse/TEZ-4246
Project: Apache Tez
Issue Type: Improvement
Affects Versions: 0.9.2, 0.10.0
Reporter: okumin
This ticket would help a task attempt avoid overusing a specific disk.
I have observed PipelinedSorter repeat spilling a large amount of data to one of two disks.
In case that NodeManager has just two disks, they are basically selected in a round-robin fashion completely.
[https://github.com/apache/hadoop/blob/rel/release-3.1.3/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/LocalDirAllocator.java#L422-L439]
Each iteration of a spill tries to create its data file and the index file, meaning that Tez is likely to put all data files on the same disk in such cases.
This unfair usage is inconvenient especially when we use features with a soft limit like this.
* https://issues.apache.org/jira/browse/TEZ-4112
Index files are relatively small, and I'd say we can put a data file and its index file in the same directory so that the round-robin doesn't skip any disks for such small usage.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)