You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Marcelo Vanzin (JIRA)" <ji...@apache.org> on 2014/12/13 00:02:13 UTC

[jira] [Commented] (HIVE-9017) Clean up temp files of RSC [Spark Branch]

    [ https://issues.apache.org/jira/browse/HIVE-9017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244944#comment-14244944 ] 

Marcelo Vanzin commented on HIVE-9017:
--------------------------------------

These files are created by Spark when downloading resources for the app (e.g. application jars). In standalone mode, by default, these files will end up in /tmp (java.io.tmpdir). The problem is that the app doesn't clean up these files; in fact, it can't, because they are supposed to be shared in case multiple executors run on the same host - so one executor cannot unilaterally decide to delete them.

(That's not entirely true; I guess it could, but then it would cause other executors to re-download the file when needed, so more overhead.)

This is not a problem in Yarn mode, since the temp dir is under a Yarn-managed directory that is deleted when the app shuts down.

So, while I think of a clean way to fix this in Spark, the following can be done on the Hive side:

- create an app-specific temp directory before launching the Spark app
- set {{spark.local.dir}} to that location
- delete the directory when the client shuts down

> Clean up temp files of RSC [Spark Branch]
> -----------------------------------------
>
>                 Key: HIVE-9017
>                 URL: https://issues.apache.org/jira/browse/HIVE-9017
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Rui Li
>
> Currently RSC will leave a lot of temp files in {{/tmp}}, including {{*_lock}}, {{*_cache}}, {{spark-submit.*.properties}}, etc.
> We should clean up these files or it will exhaust disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)