You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/04 18:59:27 UTC

[GitHub] [spark] nchammas opened a new pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

nchammas opened a new pull request #31742:
URL: https://github.com/apache/spark/pull/31742


   ### What changes were proposed in this pull request?
   
   This PR fixes Spark's behavior when `spark.cleaner.referenceTracking.cleanCheckpoints` is set to `true`. Previously, checkpoint data would be left around when Spark is shut down. With this PR, Spark now cleans up checkpoint data as expected.
   
   ### Why are the changes needed?
   
   Spark's behavior does not match the description of the `spark.cleaner.referenceTracking.cleanCheckpoints` configuration. 
   
   > Controls whether to clean checkpoint files if the reference is out of scope.
   
   When Spark is shut down, all references go out of scope. These changes bring Spark's behavior in line with the intended (and documented) behavior.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Previously, if you set `cleanCheckpoints` to `true`, checkpoint some data, and then immediately shut down Spark, that checkpoint data will be left intact. With this PR, that checkpoint data will now be cleaned up.
   
   ### How was this patch tested?
   
   I tested this PR manually as follows:
   
   ```sh
   # Cleanup checkpoint directory and start Spark.
   trash ./checkpoint/
   ./bin/pyspark --conf 'spark.cleaner.referenceTracking.cleanCheckpoints=true'
   ```
   
   ```python
   # Checkpoint some data and exit.
   spark.sparkContext.setCheckpointDir('./checkpoint/')
   spark.range(3).checkpoint()
   import sys; sys.exit()
   ```
   
   ```sh
   # Check that there is no checkpoint data in the checkpoint directory.
   ll ./checkpoint/*
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #31742:
URL: https://github.com/apache/spark/pull/31742#discussion_r592304336



##########
File path: core/src/main/scala/org/apache/spark/ContextCleaner.scala
##########
@@ -207,6 +207,29 @@ private[spark] class ContextCleaner(
     }
   }
 
+  /**
+   * Cleanup resources on shutdown.
+   *
+   * This is important for resources that don't live in memory, like checkpoint data,
+   * that outlive the Spark process. The normal cleanup thread in [[keepCleaning]]
+   * doesn't cleanup things on shutdown because it's interrupted by the shutdown
+   * process itself.
+   */
+  private def cleanupOnShutdown(): Unit = {
+    referenceBuffer.toArray(Array[CleanupTaskWeakReference]())
+    .foreach { ref =>
+      ref.task match {
+        case CleanCheckpoint(rddId) =>

Review comment:
       Shuffle files are handled by `DiskBlockManager`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on a change in pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on a change in pull request #31742:
URL: https://github.com/apache/spark/pull/31742#discussion_r587740705



##########
File path: core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala
##########
@@ -351,6 +351,10 @@ class ContextCleanerSuite extends ContextCleanerSuiteBase {
       case _ => false
     }, askStorageEndpoints = true).isEmpty)
   }
+
+  test("SPARK-33000: shutdown cleans up checkpointed data when configured to do so") {
+    // TODO: How do I write a test for this?
+  }

Review comment:
       I am not sure what would be an appropriate test to add here. I can directly call the `cleanupOnShutdown()` method, but that doesn't prove that Spark will do the right thing when it shuts down.
   
   Ideally, I would write some sort of acceptance test that launches a tiny Spark application and checks the checkpoint directory afterwards. But that seems very bulky and I don't know if there is an existing pattern in the codebase that I can follow.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-791032652


   > Why [this ](https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L179)didn't work? Can I get more analysis on why current code is broken?
   > 
   > Are you saying (purely my guess) that keepCleaning process is interrupted as soon as sc is stopped thus did not get chance to do actual cleaning?
   
   That's correct. I mentioned this in [my comment here](https://github.com/apache/spark/pull/31742/files#diff-94fafee9e1c5fefb2cb673151a31682c9a66f5605544021f16d33449eb1522b8R210-R217) for the new `cleanupOnShutdown` utility.
   
   > In that case, you find something general...checkpoint cleaning is not the only method that need to be moved to shutdown hook then...
   
   This is a good point. I suppose shuffle data and disk caching of RDDs are potentially also affected. In that case, `keepCleaning` should probably be refactored so it calls a separate cleanup method that is also added as a shutdown hook.
   
   > That would explain why you are able to repro locally but I'm not able to repro with my real script, my script does run much longer after last checkpoint.
   
   I posted a reproduction in the description of this PR. If you follow my steps, are you able to reproduce the issue?
   
   It's possible that if your script runs for a long time after the last checkpoint, and the checkpoint goes out of scope long before shutdown, then the checkpoint will get cleaned up. That would explain why you are not seeing the issue. If you checkpoint something right before shutdown you should be able to reproduce the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a change in pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on a change in pull request #31742:
URL: https://github.com/apache/spark/pull/31742#discussion_r591830020



##########
File path: core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala
##########
@@ -351,6 +351,10 @@ class ContextCleanerSuite extends ContextCleanerSuiteBase {
       case _ => false
     }, askStorageEndpoints = true).isEmpty)
   }
+
+  test("SPARK-33000: shutdown cleans up checkpointed data when configured to do so") {
+    // TODO: How do I write a test for this?
+  }

Review comment:
       We have a couple of thrift tests that do something like that. I have to admit it is not pretty.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790870975


   **[Test build #135768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135768/testReport)** for PR 31742 at commit [`ce28904`](https://github.com/apache/spark/commit/ce289045f2572b15fe1d580ed2ae8520c294096f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-796003983


   > When Spark is shut down, all references go out of scope. These changes bring Spark's behavior in line with the intended (and documented) behavior.
   
   I think you are wrong here: checkpointing should outlive the unexpected shutdowns. So there is a very important difference between the reference goes out of scope during a normal execution (in this case cleanup is expected depending on the config you mentioned) and when a references goes out of scope because of an unexpected error (in this case you should keep the checkpoint data).
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790872391


   **[Test build #135769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135769/testReport)** for PR 31742 at commit [`ee6782b`](https://github.com/apache/spark/commit/ee6782bd376c0ea7ec720dabc436b9bb69a69651).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-796436314


   @nchammas
   If I am right you found something which should be clarified better with the documentation and tested directly with a unit test.
   That will be a very good contribution. 
   
   To have checkpoint directory cleaned up at shutdown one can register the directory to be deleted at exit:
   
   ```
    FileSystem fs = FileSystem.get(conf);
    fs.deleteOnExit(checkpointPath);
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a change in pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on a change in pull request #31742:
URL: https://github.com/apache/spark/pull/31742#discussion_r591830413



##########
File path: core/src/main/scala/org/apache/spark/ContextCleaner.scala
##########
@@ -207,6 +207,29 @@ private[spark] class ContextCleaner(
     }
   }
 
+  /**
+   * Cleanup resources on shutdown.
+   *
+   * This is important for resources that don't live in memory, like checkpoint data,
+   * that outlive the Spark process. The normal cleanup thread in [[keepCleaning]]
+   * doesn't cleanup things on shutdown because it's interrupted by the shutdown
+   * process itself.
+   */
+  private def cleanupOnShutdown(): Unit = {
+    referenceBuffer.toArray(Array[CleanupTaskWeakReference]())
+    .foreach { ref =>
+      ref.task match {
+        case CleanCheckpoint(rddId) =>

Review comment:
       How are shuffles cleaned? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790973422






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-820597207


   Closing this PR per [my latest comment on Jira](https://issues.apache.org/jira/browse/SPARK-33000?focusedCommentId=17322333&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17322333). Perhaps someone can update the docs for `cleanCheckpoints` in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790950343


   The [one test failure](https://github.com/apache/spark/pull/31742/checks?check_run_id=2033911288#step:9:12581) appears unrelated to this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790870975


   **[Test build #135768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135768/testReport)** for PR 31742 at commit [`ce28904`](https://github.com/apache/spark/commit/ce289045f2572b15fe1d580ed2ae8520c294096f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790872391


   **[Test build #135769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135769/testReport)** for PR 31742 at commit [`ee6782b`](https://github.com/apache/spark/commit/ee6782bd376c0ea7ec720dabc436b9bb69a69651).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hywang203 commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
hywang203 commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790992232


   [anything ](https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L197)other than cleanupCheckpoint is marked as blocking call... that alone can cause un-finished cleanup.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790935554


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40351/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-796724146


   > Something else is ensuring that bit of cleanup on shutdown.
   
   RDD/Shuffle files are handled by the `DiskBlockManager`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790966556


   **[Test build #135769 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135769/testReport)** for PR 31742 at commit [`ee6782b`](https://github.com/apache/spark/commit/ee6782bd376c0ea7ec720dabc436b9bb69a69651).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros edited a comment on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
attilapiros edited a comment on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-796003983


   > When Spark is shut down, all references go out of scope. These changes bring Spark's behavior in line with the intended (and documented) behavior.
   
   I think you are wrong here: checkpointing should outlive the unexpected shutdowns. So there is a very important difference between the reference goes out of scope during a normal execution (in this case cleanup is expected depending on the config you mentioned) and when a references goes out of scope because of an unexpected error (in this case you should keep the checkpoint data).
   
   So even after an unexpected exit the next run of the same app could pick up the checkpointed data.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-797624636


   > To have checkpoint directory cleaned up at shutdown one can register the directory to be deleted at exit:
   > 
   > ```
   >  FileSystem fs = FileSystem.get(conf);
   >  fs.deleteOnExit(checkpointPath);
   > ```
   
   Are you suggesting I do that somewhere in this PR, or are you saying this is the workaround we should recommend to people who actually want checkpoints to be cleaned up automatically on shutdown?
   
   If the former, then I am confused. This goes against your argument that we should not do this automatic cleanup on shutdown. So I think you mean the latter. If so, I'm not sure where we would document that suggestion for users. Every language will require a different technique, so I don't think it makes sense to add to the docs for `.checkpoint()`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas closed pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas closed pull request #31742:
URL: https://github.com/apache/spark/pull/31742


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on a change in pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on a change in pull request #31742:
URL: https://github.com/apache/spark/pull/31742#discussion_r587745794



##########
File path: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
##########
@@ -755,6 +755,16 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging {
     assert(output.toList === List(Int.MaxValue, Int.MaxValue, 4, 3, 2, Int.MinValue, Int.MinValue))
   }
 
+  test("shutdown hook priorities") {
+    import org.apache.spark.util.ShutdownHookManager.{
+      DEFAULT_SHUTDOWN_PRIORITY,
+      SPARK_CONTEXT_SHUTDOWN_PRIORITY,
+      TEMP_DIR_SHUTDOWN_PRIORITY
+    }
+    assert(SPARK_CONTEXT_SHUTDOWN_PRIORITY < DEFAULT_SHUTDOWN_PRIORITY)
+    assert(TEMP_DIR_SHUTDOWN_PRIORITY < SPARK_CONTEXT_SHUTDOWN_PRIORITY)

Review comment:
       The comment blocks above these constants explain that they _must_ have a certain order to them. So I figured it would be helpful to check that ordering explicitly in a test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas edited a comment on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas edited a comment on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-795924048


   In my testing I'm seeing that disk-persisted data (i.e. `.persist(DISK_ONLY)`) is cleaned up on shutdown, so that is somehow not affected by the shutdown interruption to `keepCleaning()` that we discussed. Something else is ensuring that bit of cleanup on shutdown.
   
   The last people to work on this cleanup logic, from what I can tell, were @witgo and @andrewor14 (e.g. #855). But that was ~6 years ago, and I don't know if they are still contributing to the project.
   
   I think it's fine to limit the scope of this fix to checkpoint data (since that's what the ticket is about), but if we can reproduce other problems with the shutdown cleanup then I suppose we could fix them in this PR as well.
   
   Let me see if I can get a committer from the dev list to chime in here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790929676


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40351/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-797647181


   The latter. If it will be document then it should be with a huge warning as in that case the data will be lost when an unexpected exit  happens and will lead to massive recalculation.
   
   This is definitely not an API doc but must be in one of the programmers guide, like `rdd-programming-guide.md`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-795924048


   In my testing I'm seeing that disk-persisted data (i.e. `.persist(DISK_ONLY)`) is cleaned up on shutdown, so that is somehow not affected by the shutdown interruption to `keepCleaning()` that we discussed. Something else is ensuring that bit of cleanup on shutdown.
   
   The last people to work on this cleanup logic, from what I can tell, were @witgo and @andrewor14. But that was ~6 years ago, and I don't know if they are still contributing to the project.
   
   I think it's fine to limit the scope of this fix to checkpoint data (since that's what the ticket is about), but if we can reproduce other problems with the shutdown cleanup then I suppose we could fix them in this PR as well.
   
   Let me see if I can get a committer from the dev list to chime in here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790907633


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40351/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790935554


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40351/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hywang203 commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
hywang203 commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790987483


   Why [this ](https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L179)didn't work? Can I get more analysis on why current code is broken?
   
   Are you saying (purely my guess) that keepCleaning process is interrupted as soon as sc is stopped thus did not get chance to do actual cleaning? In that case, you find something general...checkpoint cleaning is not the only method that need to be moved to shutdown hook then...
   
   That would explain why you are able to repro locally but I'm not able to repro with my real script, my script does run much longer after last checkpoint.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790973416






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-790953880


   **[Test build #135768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135768/testReport)** for PR 31742 at commit [`ce28904`](https://github.com/apache/spark/commit/ce289045f2572b15fe1d580ed2ae8520c294096f).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nchammas commented on pull request #31742: [SPARK-33000] Cleanup checkpoint data on shutdown

Posted by GitBox <gi...@apache.org>.
nchammas commented on pull request #31742:
URL: https://github.com/apache/spark/pull/31742#issuecomment-796110416


   > I think you are wrong here: checkpointing should outlive the unexpected shutdowns.
   
   There's a good case to make for that behavior, sure, but that's a new guarantee that you are asking to formalize. As far as I'm aware, the docs offer no such guarantee today, so it's not fair to call my characterization of the current intended behavior as "wrong". :)
   
   I see you posted a similar comment [on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html), so I will respond there in more detail.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org