You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Tathagata Das (JIRA)" <ji...@apache.org> on 2015/08/21 23:18:45 UTC

[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files

    [ https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707482#comment-14707482 ] 

Tathagata Das commented on SPARK-5836:
--------------------------------------

For everyone who come across this JIRA, the title is EXTREMELY misleading. Since Spark 1.0, Spark cleans shuffle files based on the GC on the driver - when the shuffle is not referenced any more through any active RDD, then the corresponding shuffle files are deleted from the executors. It may so happen that the GC on the driver may not clear the shuffle objects for a long time (depends on heap size and all), and so shuffle files may not 

The only case there may be issues is when the external shuffle service is used. That is designed to delete shuffle files only on application exit. So there may be issues with long running apps like Spark Streaming apps

> Highlight in Spark documentation that by default Spark does not delete its temporary files
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5836
>                 URL: https://issues.apache.org/jira/browse/SPARK-5836
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Tomasz Dudziak
>            Assignee: Ilya Ganelin
>            Priority: Minor
>             Fix For: 1.3.1, 1.4.0
>
>
> We recently learnt the hard way (in a prod system) that Spark by default does not delete its temporary files until it is stopped. WIthin a relatively short time span of heavy Spark use the disk of our prod machine filled up completely because of multiple shuffle files written to it. We think there should be better documentation around the fact that after a job is finished it leaves a lot of rubbish behind so that this does not come as a surprise.
> Probably a good place to highlight that fact would be the documentation of {{spark.local.dir}} property, which controls where Spark temporary files are written. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org