You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jelmer <jk...@gmail.com> on 2020/05/24 13:42:15 UTC

Cleanup hook for temporary files produced as part of a spark job

I am writing something that partitions a data set and then trains a machine
learning model on the data in each partition

The resulting model is very big  and right now i am storing it in an rdd as
a pair of  :
partition_id and very_big_model_that_is_hundreds_of_megabytes_big

but it is becoming increasingly apparent that storing data that big in a
single row of an RDD causes all sorts of complications

So i figured that instead i could save this model to the filesystem and
store a pointer to the model (file path) in the RDD.  Then i would simply
load the model again in a mapPartitions function and avoid the issue

But it raises the question of when to clean up these temporary files. Is
there some way to ensure that files outputted by spark code get cleaned up
when the sparksession ends or the rdd is no longer referenced ?

Or is there any other solution to this problem ?