You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Steve Loughran <st...@cloudera.com.INVALID> on 2021/04/06 12:45:23 UTC

Re: Shutdown cleanup of disk-based resources that Spark creates

On Thu, 11 Mar 2021 at 19:58, Attila Zsolt Piros <
piros.attila.zsolt@gmail.com> wrote:

> I agree with you to extend the documentation around this. Moreover I
> support to have specific unit tests for this.
>
> > There is clearly some demand for Spark to automatically clean up
> checkpoints on shutdown
>
> What about I suggested on the PR? To clean up the checkpoint directory at
> shutdown one can register the directory to be deleted at exit:
>
>  FileSystem fs = FileSystem.get(conf);
>  fs.deleteOnExit(checkpointPath);
>
>>
>>>>>>
 I wouldn't recommend that. It's really for testing. It should probably get
tagged as deprecated. Better for your own cleanup code to have some atomic
bool which makes the decision.


   1. It does the delete sequentially -the more paths, the longer it takes
   2. doesn't notice/skip if a file has changed since it's added
   3. doesn't distinguish from files and dirs. So if you have a file /temp/1
   4. then replace it with dir /temp/1, the entire tree gets deleted on
   shutdown. Is that what you wanted.

I've played with some optimisation of the s3a case (
https://github.com/apache/hadoop/pull/1924 ) ; but  really it should be
some of

-store any checksum/timestamp/size on submit, + dir/file status
-only delete on a match
-do this in a thread pool. Though you can't always create them on shutdown,
can you?

But of course do that and something, somewhere will break.

safer to roll your own.