You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (Jira)" <ji...@apache.org> on 2020/09/25 21:41:00 UTC
[jira] [Created] (SPARK-33000) cleanCheckpoints config does not
clean all checkpointed RDDs on shutdown
Nicholas Chammas created SPARK-33000:
----------------------------------------
Summary: cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
Key: SPARK-33000
URL: https://issues.apache.org/jira/browse/SPARK-33000
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.6
Reporter: Nicholas Chammas
Maybe it's just that the documentation needs to be updated, but I found this surprising:
{code:java}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected Spark to clean it up on shutdown.
The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:
> Controls whether to clean checkpoint files if the reference is out of scope.
When Spark shuts down, everything goes out of scope, so I'd expect all checkpointed RDDs to be cleaned up.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org