You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/03 00:06:24 UTC

[GitHub] [spark] kevin85421 opened a new pull request, #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

kevin85421 opened a new pull request, #37384:
URL: https://github.com/apache/spark/pull/37384

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   ![milestone2](https://user-images.githubusercontent.com/20109646/182496148-8d19d885-2050-45fa-8653-682129123d44.png)
   A. Non-Task Failure: Driver does not send any LaunchTask message to Executor, and thus there is no running task on Executor. It is impossible to be Task Failure.
   
   B. Non-Task Failure: If a RPC failure happens before receiving the StatusUpdate message from Executor, it is Non-Task Failure because the task does not start to run in Executor.
   
   C. Task Failure: If there is any running task in the Executor, we will recognize the RPC failure as Task Failure.
   
   D. Non-Task Failure: There is no running task on the Executor when all tasks on the Executor are in finished states (FINISHED, FAILED, KILLED, LOST). Hence, the RPC failures in this phase are Network Failure.
   ### Why are the changes needed?
   There are two possible reasons, including Network Failure and Task Failure, to make RPC failures.
   
   (1) Task Failure: The network is good, but the task causes the executor's JVM crash. Hence, RPC fails.
   (2) Network Failure: The executor works well, but the network between Driver and Executor is broken. Hence, RPC fails.
   
   We should handle these two different kinds of failure in different ways. First, if the failure is Task Failure, we should increment the variable `numFailures`. If the value of `numFailures` is larger than a threshold, Spark will label the job failed. Second, if the failure is Network Failure, we will not increment the variable `numFailures`. We will just assign the task to a new executor. Hence, the job will not be recognized as failed due to Network Failure.
   
   However, currently, Spark recognizes every RPC failure as Task Failure. Hence, this PR aims to categorize RPC failures into two categories, Task Failure and Network Failure.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   ```
   build/sbt "core/testOnly *TaskSchedulerImplSuite -- -z SPARK-39955"
   build/sbt "core/testOnly *TaskSetManagerSuite"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kevin85421 commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
kevin85421 commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1207165550

   Hi @mridulm,
   
   Take a case in Databricks as an example, we have observed that all the sockets for executor connections on driver are closed by unknown reasons. Hence, driver will trigger `onDisconnected` again and again until Spark job fails. In that period, driver cannot connect to any existing executor, but driver recovers (can connect to new executors) after few minutes. Hence, we need to avoid Spark jobs failing in this period.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kevin85421 commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
kevin85421 commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1205832942

   Hi @mridulm,
   
   # Task Failure vs Non-Task Failure
   We can categorize executor loss reasons into two categories.
   
   (1) Task Failure: The network is good, but the task causes the executor's JVM crash. Hence, executor loses. Take the following as an example, the tasks will close executor's JVM by `System.exit(0)`. If the executor loss is caused by Task Failure, we should increment the variable `numFailures`. If the value of `numFailures` is larger than a threshold, Spark will label the job failed.
   ```
   sc.parallelize(1 to 10, 1).map{ x => System.exit(0) }.count()
   ```
   
   (2) Non-Task Failure: If the executor loss is not caused by problematic tasks such as the above example, it is caused by Non-Task Failure. For examples, 
   
   Example1: The executor's JVM is closed by the failure of directory creation. 
   Example2: The executor works well, but the network between Driver and Executor is broken.
   
   In these cases, we should not fail the Spark job. We just need to assign the tasks to other nodes.
   
   # Why do we need this PR?
   Currently, driver will consider every executor loss as Task Failure. However, some executor losses may be caused by Non-Task Failures (e.g. network disconnection), and thus some extra Spark job failures may occur. With this PR, we can categorize the reason of executor loss into Task Failure and Non-Task Failure, so some extra Spark job failures can be avoided.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1209917086

   My main concern is whether we are working around an issue that should be investigated/fixed.
   Having said that, as I mentioned above, I dont see an issue as such with the proposed change - and should be a positive enhancement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1206118791

   My query was slightly different - are you actually seeing this issue ? Can you elaborate on what was observed ?
   There are corner cases which we do not deal with currently in spark, given the complexity of handling them, and given how infrequent they are. A stage/job is not immediately failed when a single task fails - it is retried, and there are various exclude lists which prevent repeated schedule on the same executor/node and quickly fail the stage/job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1205741287

   I am trying to understand what is the actual usecase we are trying to target here.
   There is a behavior change here which has nontrivial implications - and I want to understand why we are doing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kevin85421 commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
kevin85421 commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1203340760

   cc. @Ngone51 @jiangxb1987 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1204764941

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kevin85421 commented on a diff in pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
kevin85421 commented on code in PR #37384:
URL: https://github.com/apache/spark/pull/37384#discussion_r943141768


##########
core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala:
##########
@@ -108,6 +110,10 @@ class TaskInfo(
     }
   }
 
+  private[spark] def launchSucceeded: Unit = {

Review Comment:
   Updated.



##########
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala:
##########
@@ -1067,11 +1067,14 @@ private[spark] class TaskSetManager(
       }
     }
     for ((tid, info) <- taskInfos if info.running && info.executorId == execId) {
+      // If the task is launching, this indicates that Driver has sent LaunchTask to Executor,
+      // but Executor has not sent StatusUpdate(TaskState.RUNNING) to Driver. Hence, we assume that
+      // the task is not running, and it is NetworkFailure rather than TaskFailure.

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a diff in pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
mridulm commented on code in PR #37384:
URL: https://github.com/apache/spark/pull/37384#discussion_r941817602


##########
core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala:
##########
@@ -108,6 +110,10 @@ class TaskInfo(
     }
   }
 
+  private[spark] def launchSucceeded: Unit = {

Review Comment:
   `def launchSucceeded` -> `def launchSucceeded()`



##########
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala:
##########
@@ -1067,11 +1067,14 @@ private[spark] class TaskSetManager(
       }
     }
     for ((tid, info) <- taskInfos if info.running && info.executorId == execId) {
+      // If the task is launching, this indicates that Driver has sent LaunchTask to Executor,
+      // but Executor has not sent StatusUpdate(TaskState.RUNNING) to Driver. Hence, we assume that
+      // the task is not running, and it is NetworkFailure rather than TaskFailure.

Review Comment:
   Move the comment to the `!info.launching` case ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1208881063

   I am not sure if this is some nuance of Standalone scheduler - will let @Ngone51 or @jiangxb1987 comment better. I have not observed such a behavior in yarn, or heard of something similar in k8s.
   
   In general, would be good to troubleshoot and fix the actual issue (why the behavior of mass disconnect was observed, how to mitigate it) - or atleast understand/explain it: since there might be implications of it elsewhere as well.
   Any more details on this would be great !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1209517121

   @mridulm The massive disconnection issue is an intermittent issue that can't be reproduced. I tend to believe it's not a Spark's issue but due to the bad nodes.
    
   The current fix doesn't target to resolve a specific issue but is a general improvement to Spark. Think about a case where the driver tries to send `LaunchTask` to an executor and the executor loses at the same time. In this case, previously, the fail-to-launch task (which hasn't even been launched on the executor) would increase the num failures due to `ExecutorProcessLost(_, _, causedByApp=true)`. After this fix, the fail-to-launch task won't increase the num failures since it's still under the `launching` state.
   
   Does this make sense to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kevin85421 commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
kevin85421 commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1212281434

   Thank @mridulm and @Ngone51 for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm closed pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
mridulm closed pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages
URL: https://github.com/apache/spark/pull/37384


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37384: [SPARK-39955][CORE] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37384:
URL: https://github.com/apache/spark/pull/37384#issuecomment-1212268374

   Merged to master.
   Thanks for fixing this @kevin85421 !
   Thanks for the review @Ngone51 :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org