You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wbo4958 (via GitHub)" <gi...@apache.org> on 2023/10/11 00:47:23 UTC

[PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

wbo4958 opened a new pull request, #43323:
URL: https://github.com/apache/spark/pull/43323

   ### What changes were proposed in this pull request?
   This PR is a follow-up of https://github.com/apache/spark/pull/37268 which supports stage-level task resource profile for standalone cluster when dynamic allocation is disabled. This PR enables stage-level task resource profile for the Kubernetes cluster.
   
   
   ### Why are the changes needed?
   
   Users who work on spark ML/DL cases running on Kubernetes would expect stage-level task resource profile feature.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   
   The current tests of https://github.com/apache/spark/pull/37268 can also cover this PR since both Kubernetes and standalone cluster share the same TaskSchedulerImpl class which implements this feature. Apart from that, modifying the existing test to cover the Kubernetes cluster. Apart from that, I also performed some manual tests which have been updated in the comments.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on PR #43323:
URL: https://github.com/apache/spark/pull/43323#issuecomment-1759782896

   +1 looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "asfgit (via GitHub)" <gi...@apache.org>.
asfgit closed pull request #43323: [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled
URL: https://github.com/apache/spark/pull/43323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on PR #43323:
URL: https://github.com/apache/spark/pull/43323#issuecomment-1761738892

   merged to master and branch-3.5.  Thanks @wbo4958 @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.
wbo4958 commented on PR #43323:
URL: https://github.com/apache/spark/pull/43323#issuecomment-1756553207

   # Manual tests
   
   Due to the challenges of conducting Kubernetes application tests within Spark unit tests, I took the initiative to manually perform several tests on the local Kubernetes cluster.
   
   I followed https://jaceklaskowski.github.io/spark-kubernetes-book/demo/spark-shell-on-minikube/ tutorial to setup a local Kubernetes cluster.
   
   ``` bash
   minikube delete
   minikube start
   eval $(minikube -p minikube docker-env)
   cd $SPARK_HOME
   ./bin/docker-image-tool.sh \
     -m \
     -t pr_k8s \
     build
   
   eval $(minikube -p minikube docker-env)
   kubectl create ns spark-demo
   kubens spark-demo
   cd $SPARK_HOME
   
   K8S_SERVER=$(kubectl config view --output=jsonpath='{.clusters[].cluster.server}')
   ```
   
   ## With dynamic allocation disabled.
   
   ``` bash
   ./bin/spark-shell --master k8s://$K8S_SERVER   \
     --conf spark.kubernetes.container.image=spark:pr_k8s \
     --conf spark.kubernetes.context=minikube  \
     --conf spark.kubernetes.namespace=spark-demo   \
     --verbose \
     --num-executors=1 \
     --conf spark.executor.cores=4  \
     --conf spark.task.cpus=1  \
     --conf spark.dynamicAllocation.enabled=fasle
   ```
   
   The above command requires 1 executor with 4 CPU cores, and the default `task.cpus = 1`, so the default tasks parallelism is 4 at a time.
   
   1. `task.cores=1`
   
   Test code:
   
   ``` scala
   import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}
   
   val rdd = sc.range(0, 100, 1, 4)
   var rdd1 = rdd.repartition(3)
   
   val treqs = new TaskResourceRequests().cpus(1)
   val rp = new ResourceProfileBuilder().require(treqs).build
   
   rdd1 = rdd1.withResources(rp)
   rdd1.collect()
   ```
   
   When the required `task.cpus=1`, `executor.cores=4` (No executor resource specified, use the default one), there will be 4 tasks running for rp.
   
   The entire Spark application consists of a single Spark job that will be divided into two stages. The first shuffle stage comprises four tasks, all of which will be executed simultaneously. And the second ResultStage comprises 3 tasks, and all of which will be executed simultaneously since the required `task.cpus` is  1.
   
   ![task_1](https://github.com/apache/spark/assets/1320706/b2c76b8f-a9d8-4c33-94be-0c6f4f11af31)
   
   
   
   2. `task.cores=2`
   
   Test code,
   
   ``` scala
   import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}
   
   val rdd = sc.range(0, 100, 1, 4)
   var rdd1 = rdd.repartition(3)
   
   val treqs = new TaskResourceRequests().cpus(2)
   val rp = new ResourceProfileBuilder().require(treqs).build
   
   rdd1 = rdd1.withResources(rp)
   rdd1.collect()
   ```
   
   When the required `task.cpus=2`, `executor.cores=4` (No executor resource specified, use the default one), there will be 2 tasks running for rp.
   
   The first shuffle stage behaves the same as the first one. 
   
   The second ResultStage comprises 3 tasks, so the first 2 tasks will be running at a time, and then execute the last task.
   
   ![task_2](https://github.com/apache/spark/assets/1320706/ce44dc67-9b77-461b-a041-d266b7b5fb61)
   
   
   3. `task.cores=3`
   
   Test code,
   
   ``` scala
   import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}
   
   val rdd = sc.range(0, 100, 1, 4)
   var rdd1 = rdd.repartition(3)
   
   val treqs = new TaskResourceRequests().cpus(3)
   val rp = new ResourceProfileBuilder().require(treqs).build
   
   rdd1 = rdd1.withResources(rp)
   rdd1.collect()
   ```
   
   When the required `task.cpus=3`, `executor.cores=4` (No executor resource specified, use the default one), there will be 1 task running for rp.
   
   The first shuffle stage behaves the same as the first one. 
   
   The second ResultStage comprises 3 tasks, all of which will be running serially.
   
   ![task_3](https://github.com/apache/spark/assets/1320706/0b683a68-9624-41fd-b164-7eb45797e592)
   
   
   4. `task.cores=5`
   
   exception happened.
   
   ``` scalas
   import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}
   
   val rdd = sc.range(0, 100, 1, 4)
   var rdd1 = rdd.repartition(3)
   val treqs = new TaskResourceRequests().cpus(5)
   val rp = new ResourceProfileBuilder().require(treqs).build
   
   rdd1 = rdd1.withResources(rp)
   
   rdd1.collect()
   ```
   
   ``` console
   scala> import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}
        | 
        | val rdd = sc.range(0, 100, 1, 4)
        | var rdd1 = rdd.repartition(3)
        | val treqs = new TaskResourceRequests().cpus(5)
        | val rp = new ResourceProfileBuilder().require(treqs).build
        | 
        | rdd1 = rdd1.withResources(rp)
        | 
        | rdd1.collect()
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
   org.apache.spark.SparkException: The number of cores per executor (=4) has to be >= the number of cpus per task = 5.
     at org.apache.spark.resource.ResourceUtils$.validateTaskCpusLargeEnough(ResourceUtils.scala:412)
     at org.apache.spark.resource.ResourceProfile.calculateTasksAndLimitingResource(ResourceProfile.scala:182)
     at org.apache.spark.resource.ResourceProfile.$anonfun$limitingResource$1(ResourceProfile.scala:152)
     at scala.Option.getOrElse(Option.scala:201)
     at org.apache.spark.resource.ResourceProfile.limitingResource(ResourceProfile.scala:151)
     at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:141)
     at org.apache.spark.rdd.RDD.withResources(RDD.scala:1829)
     ... 42 elided
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.
wbo4958 commented on code in PR #43323:
URL: https://github.com/apache/spark/pull/43323#discussion_r1355964012


##########
core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala:
##########
@@ -138,7 +138,7 @@ class ResourceProfileManagerSuite extends SparkFunSuite {
       rpmanager.isSupported(taskProf)
     }.getMessage
     assert(error === "TaskResourceProfiles are only supported for Standalone " +
-      "and Yarn cluster for now when dynamic allocation is disabled.")
+      "and Yarn and Kubernetes cluster for now when dynamic allocation is disabled.")

Review Comment:
   Good suggestion, Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.
wbo4958 commented on PR #43323:
URL: https://github.com/apache/spark/pull/43323#issuecomment-1757233100

   Hi @tgravescs @mridulm, Could you help review this PR which is similar to https://github.com/apache/spark/pull/43030. Thx very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "tgravescs (via GitHub)" <gi...@apache.org>.
tgravescs commented on code in PR #43323:
URL: https://github.com/apache/spark/pull/43323#discussion_r1355509647


##########
core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala:
##########
@@ -138,7 +138,7 @@ class ResourceProfileManagerSuite extends SparkFunSuite {
       rpmanager.isSupported(taskProf)
     }.getMessage
     assert(error === "TaskResourceProfiles are only supported for Standalone " +
-      "and Yarn cluster for now when dynamic allocation is disabled.")
+      "and Yarn and Kubernetes cluster for now when dynamic allocation is disabled.")

Review Comment:
   nit could change this to "Standalone, YARN, and Kubernetes"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.
wbo4958 commented on PR #43323:
URL: https://github.com/apache/spark/pull/43323#issuecomment-1756539542

   This PR is similar to the https://github.com/apache/spark/pull/43030.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

Posted by "wbo4958 (via GitHub)" <gi...@apache.org>.
wbo4958 commented on PR #43323:
URL: https://github.com/apache/spark/pull/43323#issuecomment-1756559085

   ## With dynamic allocation enabled.
   
   ``` bash
   ./bin/spark-shell --master k8s://$K8S_SERVER   \
     --conf spark.kubernetes.container.image=spark:pr_k8s \
     --conf spark.kubernetes.context=minikube  \
     --conf spark.kubernetes.namespace=spark-demo   \
     --verbose \
     --num-executors=1 \
     --conf spark.executor.cores=4  \
     --conf spark.task.cpus=1  \
     --conf spark.dynamicAllocation.enabled=true
   ```
   
   The above command enables the dynamic allocation and the max executors required is set to 1 in order to test.
   
   ### TaskResourceProfile without any specific executor request information
   
   Test code,
   
   ``` scala
   import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}
   
   val rdd = sc.range(0, 100, 1, 4)
   var rdd1 = rdd.repartition(3)
   
   val treqs = new TaskResourceRequests().cpus(3)
   val rp = new ResourceProfileBuilder().require(treqs).build
   
   rdd1 = rdd1.withResources(rp)
   rdd1.collect()
   ```
   
   The rp refers to the TaskResourceProfile without any specific executor request information, thus the executor information will utilize the default values from Default ResourceProfile (executor.cores=4).
   
   The above code will require an extra executor which will have the same `executor.cores/memory` as the default ResourceProfile.
   
   ![executor_cores_4_4](https://github.com/apache/spark/assets/1320706/28ea01a8-a81b-48a1-9394-daf8835a9226)
   
   ![task_3_executor_4](https://github.com/apache/spark/assets/1320706/4229b63c-3b39-4132-b018-2470c2fbbbaf)
   
   
   ### Different executor request information 
   
   ``` scala
   import org.apache.spark.resource.{ExecutorResourceRequests, ResourceProfileBuilder, TaskResourceRequests}
   
   val rdd = sc.range(0, 100, 1, 4)
   var rdd1 = rdd.repartition(3)
   
   val ereqs = new ExecutorResourceRequests().cores(6);
   val treqs = new TaskResourceRequests().cpus(5)
   
   val rp = new ResourceProfileBuilder().require(ereqs).require(treqs).build
   
   rdd1 = rdd1.withResources(rp)
   rdd1.collect()
   ```
   ![executor_cores_4_6](https://github.com/apache/spark/assets/1320706/d0fb9f23-ce72-4353-b2bb-3700ef5ecd3b)
   
   we can see the "Executor ID = 2" has the 6 CPU cores.
   
   ![task_5_executor_cores_6](https://github.com/apache/spark/assets/1320706/c58bd66d-0fb5-452f-8fd2-889ddcf52fb8)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org