You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Adam Kennedy (Jira)" <ji...@apache.org> on 2021/08/06 16:22:00 UTC

[jira] [Created] (SPARK-36446) After dynamic deallocation YARN shuffle server restart crashes all jobs

Adam Kennedy created SPARK-36446:
------------------------------------

             Summary: After dynamic deallocation YARN shuffle server restart crashes all jobs
                 Key: SPARK-36446
                 URL: https://issues.apache.org/jira/browse/SPARK-36446
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 3.1.2, 2.4.8
            Reporter: Adam Kennedy


When dynamic allocation is enabled, executors that deallocate rely on the shuffle server to hold blocks and supply them to remaining executors.

When YARN Shuffle Server restarts (either intentionally or due to a crash), it loses block information and relies on being able to contact Executors (the locations of which it durably stores) to refetch the list of blocks.

This mutual dependency on the other to hold block information fails fatally under some common scenarios.

For example, if a Spark application is running under dynamic allocation, some amount of executors will almost always shut down.

If, after this has occurred, any shuffle server crashes, or is restarted (either directly when running as a standalone service, or as part of a YARN node manager restart) then there is no way to restore block data and it is permanently lost.

Worse, when Executors try to fetch blocks from the shuffle server, the shuffle server cannot location the exeutor, decides it doesn't exist, treats it as a fatal exception, and causes the application to terminate and crash.

Thus, in a real world scenario that we observe on a 1000+ node multi-tenant cluster  where dynamic allocation is on by default, a rolling restart of the YARN node managers will cause ALL jobs that have deallocated any executor and have shuffles or transferred blocks to the shuffle server in order to shut down, to crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org