You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tez.apache.org by "okumin (Jira)" <ji...@apache.org> on 2020/11/06 13:55:00 UTC

[jira] [Created] (TEZ-4246) Avoid uneven local disk usage for spills

okumin created TEZ-4246:
---------------------------

             Summary: Avoid uneven local disk usage for spills
                 Key: TEZ-4246
                 URL: https://issues.apache.org/jira/browse/TEZ-4246
             Project: Apache Tez
          Issue Type: Improvement
    Affects Versions: 0.9.2, 0.10.0
            Reporter: okumin


This ticket would help a task attempt avoid overusing a specific disk.

 

I have observed PipelinedSorter repeat spilling a large amount of data to one of two disks.

In case that NodeManager has just two disks, they are basically selected in a round-robin fashion completely.

[https://github.com/apache/hadoop/blob/rel/release-3.1.3/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/LocalDirAllocator.java#L422-L439]

Each iteration of a spill tries to create its data file and the index file, meaning that Tez is likely to put all data files on the same disk in such cases.

 

This unfair usage is inconvenient especially when we use features with a soft limit like this.
 * https://issues.apache.org/jira/browse/TEZ-4112

 

Index files are relatively small, and I'd say we can put a data file and its index file in the same directory so that the round-robin doesn't skip any disks for such small usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)