You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "lyy-pineapple (via GitHub)" <gi...@apache.org> on 2023/10/13 08:28:38 UTC

[PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

lyy-pineapple opened a new pull request, #43366:
URL: https://github.com/apache/spark/pull/43366

   
   
   ### What changes were proposed in this pull request?
   After the task set fails to find schedulable tasks, before checking if there are tasks that cannot be scheduled on any executor, filter out the executors corresponding to the profileId of that task set.
   
   
   ### Why are the changes needed?
   
   In scenarios involving multiple resource profiles (e.g., prof1 and prof2), when a taskset (prof1) has only one remaining task (task0) awaiting scheduling, and there are executors (executor0 with prof0 and executor1 with prof1), if executor1 fails to run task0, executor1 gets blacklisted for task0. Consequently, task0 becomes unschedulable, leading to a blockage in the task scheduling process.
   
   **Example Code:**
   
   ```
   val rprof = new ResourceProfileBuilder()
   val ereqs = new ExecutorResourceRequests()
   ereqs.memory("4g")
   ereqs.memoryOverhead("2g").offHeapMemory("1g")
   val resourceProfile = rprof.require(ereqs).build()
   val rdd = sc.parallelize(1 to 10, 1).withResources(resourceProfile)
   rdd.map(num => {
     if (TaskContext.get().attemptNumber() == 0) {
       throw new RuntimeException("First attempt encounters an error")
     } else {
       num / 2
     }
   }).collect()
   ```
   
   
   **Reason:**
   The issue arises when the taskSet becomes unschedulable. The logic attempts to find a task that cannot be scheduled on any executor across all profiles. However, when determining whether an executor can schedule the task, there is no distinction made based on the resource profile. This leads to an incorrect assumption that executor0 (prof0) can schedule the task, which is not the case.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Add Uts
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #43366:
URL: https://github.com/apache/spark/pull/43366#issuecomment-1911208410

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on PR #43366:
URL: https://github.com/apache/spark/pull/43366#issuecomment-1762123200

   thanks, overall issue makes sense, looking in more detail, might be monday before I get time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on PR #43366:
URL: https://github.com/apache/spark/pull/43366#issuecomment-1764675951

   Please also look at the build failures, make sure your branch is up to date with master branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on code in PR #43366:
URL: https://github.com/apache/spark/pull/43366#discussion_r1360787114


##########
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##########
@@ -162,6 +162,8 @@ private[spark] class TaskSchedulerImpl(
   // in turn is used to decide when we can attain data locality on a given host
   protected val hostToExecutors = new HashMap[String, HashSet[String]]
 
+  private val executorToProfile = new HashMap[String, Int]

Review Comment:
   rename executorIdToResourceProfileId



##########
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##########
@@ -624,7 +627,12 @@ private[spark] class TaskSchedulerImpl(
         }
 
         if (!launchedAnyTask) {
-          taskSet.getCompletelyExcludedTaskIfAny(hostToExecutors).foreach { taskIndex =>
+
+          val hostToExecutorsForTaskSet = hostToExecutors.map{ case (host, execsOnHost) =>

Review Comment:
   add space between map and {:
   
    hostToExecutors.map { case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #43366:
URL: https://github.com/apache/spark/pull/43366#issuecomment-1762042548

   +CC @tgravescs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "lyy-pineapple (via GitHub)" <gi...@apache.org>.
lyy-pineapple commented on code in PR #43366:
URL: https://github.com/apache/spark/pull/43366#discussion_r1361755513


##########
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##########
@@ -162,6 +162,8 @@ private[spark] class TaskSchedulerImpl(
   // in turn is used to decide when we can attain data locality on a given host
   protected val hostToExecutors = new HashMap[String, HashSet[String]]
 
+  private val executorToProfile = new HashMap[String, Int]

Review Comment:
   thanks,Done



##########
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##########
@@ -162,6 +162,8 @@ private[spark] class TaskSchedulerImpl(
   // in turn is used to decide when we can attain data locality on a given host
   protected val hostToExecutors = new HashMap[String, HashSet[String]]
 
+  private val executorToProfile = new HashMap[String, Int]

Review Comment:
   thanks,Done



##########
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##########
@@ -624,7 +627,12 @@ private[spark] class TaskSchedulerImpl(
         }
 
         if (!launchedAnyTask) {
-          taskSet.getCompletelyExcludedTaskIfAny(hostToExecutors).foreach { taskIndex =>
+
+          val hostToExecutorsForTaskSet = hostToExecutors.map{ case (host, execsOnHost) =>

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on code in PR #43366:
URL: https://github.com/apache/spark/pull/43366#discussion_r1362069554


##########
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##########
@@ -624,7 +627,13 @@ private[spark] class TaskSchedulerImpl(
         }
 
         if (!launchedAnyTask) {
-          taskSet.getCompletelyExcludedTaskIfAny(hostToExecutors).foreach { taskIndex =>
+
+          val hostToExecutorsForTaskSet = hostToExecutors.map { case (host, execsOnHost) =>

Review Comment:
   We are generating this map each time there are no tasks scheduled ... for larger clusters, this can go into thousands of entries in the map.
   Instead, why not maintain this map directly ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on code in PR #43366:
URL: https://github.com/apache/spark/pull/43366#discussion_r1362217264


##########
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##########
@@ -624,7 +627,13 @@ private[spark] class TaskSchedulerImpl(
         }
 
         if (!launchedAnyTask) {
-          taskSet.getCompletelyExcludedTaskIfAny(hostToExecutors).foreach { taskIndex =>
+
+          val hostToExecutorsForTaskSet = hostToExecutors.map { case (host, execsOnHost) =>

Review Comment:
   that's a good point, we care about time here more.  I think you can get rid of the executorIdToResourceProfileId in that case to so space won't be as bad.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #43366: [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile
URL: https://github.com/apache/spark/pull/43366


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org