You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/18 14:49:39 UTC

[GitHub] [spark] seayoun opened a new pull request #26938: fix executor lost in net cause app hung upp

seayoun opened a new pull request #26938: fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938
 
 
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce any user-facing change?
   <!--
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on issue #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#issuecomment-607550151
 
 
   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#discussion_r360373455
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
 ##########
 @@ -199,6 +200,18 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
       if (now - lastSeenMs > executorTimeoutMs) {
         logWarning(s"Removing executor $executorId with no recent heartbeats: " +
           s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
+        sc.schedulerBackend match {
+          case backend: CoarseGrainedSchedulerBackend =>
+            backend.synchronized {
+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false
 
 Review comment:
   I think this does not resolve the issue completely. Because when you look into `sc.killAndReplaceExecutor` below, you can see that we've already mark it as "pending to remove". Though it happens in another separate thread, but I don't see big difference according to this issue. WDYT?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun opened a new pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
seayoun opened a new pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938
 
 
   ### **What changes were proposed in this pull request?**
   
   **Backgroud**
   The driver can't sense this executor was lost through the network connection disconnection If an executor was lost in the network and it have not responsed rst and close packet to driver, so driver can only sense this executor dead through heartbeat expired.
   
   **Problems**
   Heartbeat expiration processing flow as follows:
   1. Executor heartbeat expired as above.
   2. HeartbeatReceiver will call scheduler executor lost to rescheduler the tasks on this executor.
   3. HeartbeatReceiver kill the executor.
   
   The tasks on the dead executor have a chance to rescheduled on this dead executor again if the task rescheduler before the executor has't remove from executorBackend, it will send launch task to this executor again, the executor will not response and the driver can't sense through heartbeat beause the executor has lost in network. This cause those tasks rescheduled on this lost executor can't finish forever, and the app will hung up here forever.
   This patch fix this problem, it remove the executor before rescheduler.
   
   ### **Why are the changes needed?**
   This will cause app hung up.
   
   ### **Does this PR introduce any user-facing change?**
   NO
   
   ### **How was this patch tested?**

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#discussion_r360640643
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
 ##########
 @@ -199,6 +200,18 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
       if (now - lastSeenMs > executorTimeoutMs) {
         logWarning(s"Removing executor $executorId with no recent heartbeats: " +
           s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
+        sc.schedulerBackend match {
+          case backend: CoarseGrainedSchedulerBackend =>
+            backend.synchronized {
+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false
 
 Review comment:
   We add the executor before rescheduler can avoid the tasks to rescheduler the bad executors

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on issue #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
seayoun commented on issue #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#issuecomment-568431278
 
 
   A better way to fix here https://github.com/apache/spark/pull/26980

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#discussion_r360625775
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
 ##########
 @@ -199,6 +200,18 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
       if (now - lastSeenMs > executorTimeoutMs) {
         logWarning(s"Removing executor $executorId with no recent heartbeats: " +
           s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
+        sc.schedulerBackend match {
+          case backend: CoarseGrainedSchedulerBackend =>
+            backend.synchronized {
+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false
 
 Review comment:
   The task can reschedule at this executor before mark it as "pending to remove".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#discussion_r360625700
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
 ##########
 @@ -199,6 +200,18 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
       if (now - lastSeenMs > executorTimeoutMs) {
         logWarning(s"Removing executor $executorId with no recent heartbeats: " +
           s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
+        sc.schedulerBackend match {
+          case backend: CoarseGrainedSchedulerBackend =>
+            backend.synchronized {
+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false
 
 Review comment:
   > `sc.killAndReplaceExecutor` already try to mark it as "pending to remove" this is right, but the task has rescheduled at this executor agait at this time, the executor must be removed from the ExecutorBackend to avoid, for example, it will `disableExecutor` in `CoarseGrainedSchedulerBackend` if the driver lost connection from executor.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#discussion_r360854474
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
 ##########
 @@ -199,6 +200,18 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
       if (now - lastSeenMs > executorTimeoutMs) {
         logWarning(s"Removing executor $executorId with no recent heartbeats: " +
           s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
+        sc.schedulerBackend match {
+          case backend: CoarseGrainedSchedulerBackend =>
+            backend.synchronized {
+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false
 
 Review comment:
   Hi, @seayoun. After a second think, I think this fix can really avoid app hang, though it can not clean up dead records in `CoarseGrainedSchedulerBackend`. 
   
   As you may have realized that a same issue has been posted in SPARK-27348 and its author(@xuanyuanking) has assigned to me recently. And it's really coincidental. And I really appreciate that you would like my proposal. Actually, the main idea is still base on original author's contribution.
   
   Also, thanks for review.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#discussion_r360373455
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
 ##########
 @@ -199,6 +200,18 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
       if (now - lastSeenMs > executorTimeoutMs) {
         logWarning(s"Removing executor $executorId with no recent heartbeats: " +
           s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
+        sc.schedulerBackend match {
+          case backend: CoarseGrainedSchedulerBackend =>
+            backend.synchronized {
+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false
 
 Review comment:
   I think this does not resolve the issue completely. Because when you look into `sc.killAndReplaceExecutor` below, you can see that we've already try to mark it as "pending to remove". Though it happens in another separate thread, but I don't see big difference according to this issue. WDYT?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp

Posted by GitBox <gi...@apache.org>.
seayoun commented on a change in pull request #26938: [SPARK-30297][CORE] Fix executor lost in net cause app hung upp
URL: https://github.com/apache/spark/pull/26938#discussion_r360625700
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala
 ##########
 @@ -199,6 +200,18 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock)
       if (now - lastSeenMs > executorTimeoutMs) {
         logWarning(s"Removing executor $executorId with no recent heartbeats: " +
           s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms")
+        sc.schedulerBackend match {
+          case backend: CoarseGrainedSchedulerBackend =>
+            backend.synchronized {
+              // Mark executor pending to remove if executor heartbeat expired
+              // to avoid reschedule task on this executor again
+              if (!backend.executorsPendingToRemove.contains(executorId)) {
+                backend.executorsPendingToRemove(executorId) = false
 
 Review comment:
   > `sc.killAndReplaceExecutor` already try to mark it as "pending to remove" this is right, but the task has rescheduled at this executor agait at this time, the executor must be removed from the ExecutorBackend to avoid. 
   For example, it will `disableExecutor` in `CoarseGrainedSchedulerBackend` if the driver lost connection from executor, `disableExecutor` mark the executor dead, and then to reschedule the task on the lost connection executors.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org