You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "YE (Jira)" <ji...@apache.org> on 2023/07/25 08:45:00 UTC

[jira] [Created] (SPARK-44542) easily load SparkExitCode class in SparkUncaughtExceptionHandler

YE created SPARK-44542:
--------------------------

             Summary: easily load SparkExitCode class in SparkUncaughtExceptionHandler
                 Key: SPARK-44542
                 URL: https://issues.apache.org/jira/browse/SPARK-44542
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.4.1, 3.3.2, 3.1.3
            Reporter: YE


There are two background for this improvement proposal:

1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more.

2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed.

 

We encountered a corner case combined the above two cases, which makes the spark hangs.  A short timeline could be described as below:
 # task 5372(tid: 21662) starts running in 21:55
 # the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00
 # the task continues running, at 22:05, it requests commit from coordinator and performs the commit. 
 # however due the corrupted disk, some exception raised in the executor JVM.
 # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too.
 # The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side
 # Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied.
 # The job is hanging until our SRE killed the container from outside.

Some screenshot are provided below.

!image-2023-07-25-16-37-16-821.png!

!image-2023-07-25-16-38-52-270.png!

!image-2023-07-25-16-39-40-182.png!

 

For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the 
SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org