You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:45:09 UTC

[jira] [Resolved] (SPARK-20598) Iterative checkpoints do not get removed from HDFS

     [ https://issues.apache.org/jira/browse/SPARK-20598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-20598.
----------------------------------
    Resolution: Incomplete

> Iterative checkpoints do not get removed from HDFS
> --------------------------------------------------
>
>                 Key: SPARK-20598
>                 URL: https://issues.apache.org/jira/browse/SPARK-20598
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, YARN
>    Affects Versions: 2.1.0
>            Reporter: Guillem Palou
>            Priority: Major
>              Labels: bulk-closed
>
> I am running a pyspark  application that makes use of dataframe.checkpoint() because Spark needs exponential time to compute the plan and eventually I had to stop it. Using {{checkpoint}} allowed the application to proceed with the computation, but I noticed that the HDFS cluster was filling up with RDD files. Spark is running on YARN client mode. 
> I managed to reproduce the problem in a toy example as below:
> {code}
> df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint()
> for i in range(4):
>     # either line of the following 2 will produce the error   
>     df = df.select('*', F.concat(*df.columns)).cache().checkpoint()
>     df = df.join(df, on='a').cache().checkpoint()
>     # the following two lines do not seem to have an effect
>     gc.collect()
>     sc._jvm.System.gc()
> {code}
> After running the code and {{sc.top()}}, I can still see the rdd's checkpointed in HDFS:
> {quote}
> guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18
> 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30
> 5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6
> {quote}
> The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set to {{true}}. I would expect Spark to clean up all RDDs that can't be accessed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org