You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "leletan (via GitHub)" <gi...@apache.org> on 2024/03/20 07:33:13 UTC

[PR] [Spark-47475][Core] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

leletan opened a new pull request, #45607:
URL: https://github.com/apache/spark/pull/45607

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   During SparkSubmit, for isKubernetesClusterModeDriver:
   - Stop appending primary resource to `spark.jars` to avoid duplicating the primary resource jar in `spark.jars`.
   - Make jar downloading to driver optional so executors may download jars from remote instead of driver, to avoid hot spot and executor count scaling issues.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   #### Context:
   
   To submit spark jobs to Kubernetes under cluster mode, the spark-submit will be called twice. The first time SparkSubmit will run under k8s cluster mode, it will append primary resource to `spark.jars` and call `KubernetesClientApplication::start`  to create a driver pod. The driver pod will run spark-submit again with the updated configurations (with the same application jar  but that jar will also be in the `spark.jars`). This time the SparkSubmit will run under client mode with `spark.kubernetes.submitInDriver`  as `true`. Under this mode, all the jars in `spark.jars` will be downloaded to driver and jars' urls will be replaced by the driver local paths. Later SparkSubmit will append primary resource to `spark.jars` again. So in this case, `spark.jars` will have 2 paths of duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path. Later when driver starts the `SparkContext` it will copy all the `spark.jars` to `s
 park.app.initial.jar.urls`, and replace the driver local jars paths in `spark.app.initial.jar.urls` with driver file service paths, with which the executor can download those driver local jars. 
   
   #### Issues:
   - When jars are big and the application requests a lot of executors, the massive concurrent download of the jars from the driver will cause network saturation. In this case, the executors jar download will timeout, causing executors to be terminated. From user point of view, the application is trapped in the loop of massive executor loss and re-provision, but never gets enough live executors as requested, leads to SLA breach or sometimes failure.
   - The executor will download 2 duplicate copies of primary resource, one with the original url user submit with, the other with the driver local file path, which leads to resource waste.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Added spark conf of `spark.kubernetes.jars.avoidDownloadSchemes`, this will allow users to opt out of downloading all the remote jars to driver before distributed to executors. Instead executors will be able to directly download some of jars from remote jar urls. 
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   - Unit test added / modified.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1536711323


##########
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:
##########
@@ -504,6 +504,25 @@ class SparkSubmitSuite
     }
   }
 
+  test("SPARK-47475: Not to add primary resource to jars again" +

Review Comment:
   Oh, this JIRA ID is wrong. We need to have SPARK-47495 like the PR title.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "leletan (via GitHub)" <gi...@apache.org>.
leletan commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1533169526


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -1458,6 +1458,18 @@ package object config {
       .doubleConf
       .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+    ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+      .doc("Comma-separated list of schemes for which jars will not be downloaded to the " +
+        "driver local disk prior to be distributed to executors, only for kubernetes deployment. " +
+        "For use in cases when the jars are big and executor counts are high, " +
+        "concurrent download causes network saturation and timeouts. " +
+        "Wildcard '*' is denoted to not downloading jars for any the schemes.")
+      .version("2.3.0")

Review Comment:
   Will fix and move this to another JIRA & PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2016638641

   Merged to master because the last commit only changes JIRA ID in the test case name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [Spark-47475][Core] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

Posted by "leletan (via GitHub)" <gi...@apache.org>.
leletan commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2008959217

   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2010726489

   Ack, @dbtsai .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

Posted by "leletan (via GitHub)" <gi...@apache.org>.
leletan commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1533047660


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -1458,6 +1458,18 @@ package object config {
       .doubleConf
       .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+    ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+      .doc("Comma-separated list of schemes for which jars will not be downloaded to the " +
+        "driver local disk prior to be distributed to executors, only for kubernetes deployment. " +
+        "For use in cases when the jars are big and executor counts are high, " +
+        "concurrent download causes network saturation and timeouts. " +
+        "Wildcard '*' is denoted to not downloading jars for any the schemes.")
+      .version("2.3.0")

Review Comment:
   Good catch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2011272331

   +CC @zhouyejoe 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "leletan (via GitHub)" <gi...@apache.org>.
leletan commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2011103330

   @dongjoon-hyun Updated the PR and associated it with the new JIRA https://issues.apache.org/jira/browse/SPARK-47495. Please let me know if this looks good to you. Thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2015569666

   Thank you for updating, @leletan .
   
   I'll resume the review this weekend. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #45607: [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode
URL: https://github.com/apache/spark/pull/45607


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1532943327


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -1458,6 +1458,18 @@ package object config {
       .doubleConf
       .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+    ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+      .doc("Comma-separated list of schemes for which jars will not be downloaded to the " +
+        "driver local disk prior to be distributed to executors, only for kubernetes deployment. " +
+        "For use in cases when the jars are big and executor counts are high, " +
+        "concurrent download causes network saturation and timeouts. " +
+        "Wildcard '*' is denoted to not downloading jars for any the schemes.")
+      .version("2.3.0")

Review Comment:
   This should be `4.0.0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "leletan (via GitHub)" <gi...@apache.org>.
leletan commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1538543176


##########
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala:
##########
@@ -504,6 +504,25 @@ class SparkSubmitSuite
     }
   }
 
+  test("SPARK-47475: Not to add primary resource to jars again" +

Review Comment:
   Good catch!!!
   Thanks for fixing this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47475][CORE] Fix Executors Scaling Issues Caused by Jar Download Under K8s Cluster Mode [spark]

Posted by "dbtsai (via GitHub)" <gi...@apache.org>.
dbtsai commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2010712682

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "leletan (via GitHub)" <gi...@apache.org>.
leletan commented on code in PR #45607:
URL: https://github.com/apache/spark/pull/45607#discussion_r1533169526


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -1458,6 +1458,18 @@ package object config {
       .doubleConf
       .createWithDefault(1.5)
 
+  private[spark] val KUBERNETES_AVOID_JAR_DOWNLOAD_SCHEMES =
+    ConfigBuilder("spark.kubernetes.jars.avoidDownloadSchemes")
+      .doc("Comma-separated list of schemes for which jars will not be downloaded to the " +
+        "driver local disk prior to be distributed to executors, only for kubernetes deployment. " +
+        "For use in cases when the jars are big and executor counts are high, " +
+        "concurrent download causes network saturation and timeouts. " +
+        "Wildcard '*' is denoted to not downloading jars for any the schemes.")
+      .version("2.3.0")

Review Comment:
   Will move this to another JIRA & PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2015572561

   I just assigned this to me in order not to forget. It doesn't block any community reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47495][CORE] Fix primary resource jar added to spark.jars twice under k8s cluster mode [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45607:
URL: https://github.com/apache/spark/pull/45607#issuecomment-2016639300

   Welcome to the Apache Spark community, @leletan . 
   
   I added you to the Apache Spark contributor group (in JIRA) and assigned SPARK-47495 to you.
   
   Congratulations for your first commit, @leletan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org