You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "thousandhu (via GitHub)" <gi...@apache.org> on 2023/03/10 09:19:10 UTC

[GitHub] [spark] thousandhu opened a new pull request, #40363: [SPARK_42744] delete uploaded file when job finish for k8s

thousandhu opened a new pull request, #40363:
URL: https://github.com/apache/spark/pull/40363

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Let driver delete uploaded file when job finish.
   
   
   ### Why are the changes needed?
   Now there is no deletion for files uploaded by client, which causes file leaks on remote file system.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. This PR add a new configuration spark.kubernetes.uploaded.file.delete.on.termination. By default, this configuration is false and the behavior is the same with current version. When the configuration is set to true, driver will try to delete uploaded files when job finish.
   
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] thousandhu commented on a diff in pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "thousandhu (via GitHub)" <gi...@apache.org>.
thousandhu commented on code in PR #40363:
URL: https://github.com/apache/spark/pull/40363#discussion_r1135033927


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala:
##########
@@ -706,6 +706,22 @@ private[spark] object Config extends Logging {
       .stringConf
       .createOptional
 
+  val KUBERNETES_FILE_UPLOADED_FAILS =
+    ConfigBuilder("spark.kubernetes.uploaded.files")
+      .internal()
+      .doc("Remember all uploaded uri by spark client, used to delete uris when app finished.")
+      .version("3.1.2-internal")
+      .stringConf
+      .toSequence
+      .createWithDefault(Nil)
+
+  val KUBERNETES_UPLOAD_FILE_DELETE_ON_TERMINATION =
+    ConfigBuilder("spark.kubernetes.uploaded.file.delete.on.termination")
+      .doc("Deleting uploaded file when app finished")
+      .version("3.1.2")

Review Comment:
   Change version to 3.5.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] shrprasa commented on pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "shrprasa (via GitHub)" <gi...@apache.org>.
shrprasa commented on PR #40363:
URL: https://github.com/apache/spark/pull/40363#issuecomment-1490831779

   @thousandhu @dongjoon-hyun  @holdenk 
   The approach in this PR only handles the cleanup on driver side. It won't clean up the files if files were uploaded during job submission but then submission fails due to some reason. As in this driver won't be launched and clean up will not happen. 
   
   This Jira SPARK-42744 is duplicate of SPARK-42466. I had already created PR for this issue which handles cleanup on both driver as well client side in case of app submission failure. 
   Other than this it also optimises the file upload by creating just one upload sub directory rather than created several sub - directories for each file getting uploaded. 
   
   https://github.com/apache/spark/pull/40128
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #40363:
URL: https://github.com/apache/spark/pull/40363#discussion_r1133023054


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala:
##########
@@ -706,6 +706,22 @@ private[spark] object Config extends Logging {
       .stringConf
       .createOptional
 
+  val KUBERNETES_FILE_UPLOADED_FAILS =
+    ConfigBuilder("spark.kubernetes.uploaded.files")
+      .internal()
+      .doc("Remember all uploaded uri by spark client, used to delete uris when app finished.")
+      .version("3.1.2-internal")

Review Comment:
   Thank you for making a PR, but Apache Spark codebase needs a valid Spark version info, @thousandhu .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #40363:
URL: https://github.com/apache/spark/pull/40363#issuecomment-1627554098

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] thousandhu commented on a diff in pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "thousandhu (via GitHub)" <gi...@apache.org>.
thousandhu commented on code in PR #40363:
URL: https://github.com/apache/spark/pull/40363#discussion_r1135033736


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala:
##########
@@ -706,6 +706,22 @@ private[spark] object Config extends Logging {
       .stringConf
       .createOptional
 
+  val KUBERNETES_FILE_UPLOADED_FAILS =
+    ConfigBuilder("spark.kubernetes.uploaded.files")
+      .internal()
+      .doc("Remember all uploaded uri by spark client, used to delete uris when app finished.")
+      .version("3.1.2-internal")

Review Comment:
   I've changed version to 3.5.0.
   BTW, this is a internal config, we don't want user to set it. So the config is set as .internal().



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s
URL: https://github.com/apache/spark/pull/40363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #40363:
URL: https://github.com/apache/spark/pull/40363#discussion_r1133023298


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala:
##########
@@ -706,6 +706,22 @@ private[spark] object Config extends Logging {
       .stringConf
       .createOptional
 
+  val KUBERNETES_FILE_UPLOADED_FAILS =
+    ConfigBuilder("spark.kubernetes.uploaded.files")
+      .internal()
+      .doc("Remember all uploaded uri by spark client, used to delete uris when app finished.")
+      .version("3.1.2-internal")
+      .stringConf
+      .toSequence
+      .createWithDefault(Nil)
+
+  val KUBERNETES_UPLOAD_FILE_DELETE_ON_TERMINATION =
+    ConfigBuilder("spark.kubernetes.uploaded.file.delete.on.termination")
+      .doc("Deleting uploaded file when app finished")
+      .version("3.1.2")

Review Comment:
   And, for the new feature or improvement, this should be `3.5.0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] thousandhu commented on pull request #40363: [SPARK_42744] delete uploaded file when job finish for k8s

Posted by "thousandhu (via GitHub)" <gi...@apache.org>.
thousandhu commented on PR #40363:
URL: https://github.com/apache/spark/pull/40363#issuecomment-1467439787

   > 
   
   We are using HDFS as the storage. 
   And the ttl is not enough to all apps, such as streaming and AI training job which runs for several days.
   If TTL is too short, it will affect the failover for spark apps. If TTL is very long, it cases storage waste.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org