You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "liupengcheng (JIRA)" <ji...@apache.org> on 2019/01/24 03:57:00 UTC

[jira] [Created] (SPARK-26712) Disk broken caused NM recovery failure causing YarnShuffleSerivce not available

liupengcheng created SPARK-26712:
------------------------------------

             Summary: Disk broken caused NM recovery failure causing YarnShuffleSerivce not available
                 Key: SPARK-26712
                 URL: https://issues.apache.org/jira/browse/SPARK-26712
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 2.4.0, 2.1.0
            Reporter: liupengcheng


Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery enabled, however, the recovery file is under a fixed directory, which may be unavailable if Disk broken. So if a NM restart happen(may be caused by kill or some reason), the `ExecutorShuffleInfo` will lost, and causes the shuffleservice unavailble even if there are executors on the node.

This may finally cause job failures(if node or executors on it not blacklisted), or at least, it will cause resource waste.(shuffle from this node always failed.)

For long running spark applications, this problem may be more serious.

So I think we should support multi directories(multi disk) for this recovery. and change to good directory and when the disk of current directory is broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org