You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Haoyuan Wang (Jira)" <ji...@apache.org> on 2020/10/15 17:09:00 UTC

[jira] [Comment Edited] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

    [ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214840#comment-17214840 ] 

Haoyuan Wang edited comment on SPARK-33000 at 10/15/20, 5:08 PM:
-----------------------------------------------------------------

This config will not be picked up when set after SparkContext is initiated. It has to be set during submission, before SparkContext object is initiated.

 

I'm able to repro this with Spark 2.3.0, making configuration during job submission resolved issue. What needs improved should be documentation, let me open a ticket for that and close this one.


was (Author: caowang888):
This config will not be picked up when set after SparkContext is initiated. It has to be set during submission, before SparkContext object is initiated.

 

I'm able to repro this with Spark 2.3.0, making configuration during job submission resoled issue. What needs improved should be documentation, let me open a ticket for that and close this one.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> ------------------------------------------------------------------------
>
>                 Key: SPARK-33000
>                 URL: https://issues.apache.org/jira/browse/SPARK-33000
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.6
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint]                                                           
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:
> {quote}Controls whether to clean checkpoint files if the reference is out of scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org