You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:45:09 UTC
[jira] [Resolved] (SPARK-20598) Iterative checkpoints do not get
removed from HDFS
[ https://issues.apache.org/jira/browse/SPARK-20598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-20598.
----------------------------------
Resolution: Incomplete
> Iterative checkpoints do not get removed from HDFS
> --------------------------------------------------
>
> Key: SPARK-20598
> URL: https://issues.apache.org/jira/browse/SPARK-20598
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core, YARN
> Affects Versions: 2.1.0
> Reporter: Guillem Palou
> Priority: Major
> Labels: bulk-closed
>
> I am running a pyspark application that makes use of dataframe.checkpoint() because Spark needs exponential time to compute the plan and eventually I had to stop it. Using {{checkpoint}} allowed the application to proceed with the computation, but I noticed that the HDFS cluster was filling up with RDD files. Spark is running on YARN client mode.
> I managed to reproduce the problem in a toy example as below:
> {code}
> df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint()
> for i in range(4):
> # either line of the following 2 will produce the error
> df = df.select('*', F.concat(*df.columns)).cache().checkpoint()
> df = df.join(df, on='a').cache().checkpoint()
> # the following two lines do not seem to have an effect
> gc.collect()
> sc._jvm.System.gc()
> {code}
> After running the code and {{sc.top()}}, I can still see the rdd's checkpointed in HDFS:
> {quote}
> guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6
> {quote}
> The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set to {{true}}. I would expect Spark to clean up all RDDs that can't be accessed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org