You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Alex Landa <me...@gmail.com> on 2019/07/21 06:01:33 UTC

Long-Running Spark application doesn't clean old shuffle data correctly

Hi,

We are running a long running Spark application ( which executes lots of
quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
We see that old shuffle files ( a week old for example ) are not deleted
during the execution of the application, which leads to out of disk space
errors on the executor.
If we re-deploy the application, the Spark cluster take care of the cleaning
and deletes the old shuffle data (since we have
/-Dspark.worker.cleanup.enabled=true/ in the worker config).
I don't want to re-deploy our app every week or two, but to be able to
configure spark to clean old shuffle data (as it should).

How can I configure Spark to delete old shuffle data during the life time of
the application (not after)?


Thanks,
Alex

Re: Long-Running Spark application doesn't clean old shuffle data correctly

Posted by Aayush Ranaut <aa...@gmail.com>.
This is the job of ContextCleaner. There are few a property that you can tweak to see if that helps: 
spark.cleaner.periodicGC.interval

spark.cleaner.referenceTracking

spark.cleaner.referenceTracking.blocking.shuffle



Regards

Prathmesh Ranaut

> On Jul 21, 2019, at 11:36 AM, Prathmesh Ranaut Gmail <pr...@gmail.com> wrote:
> 
> 
> This is the job of ContextCleaner. There are few a property that you can tweak to see if that helps: 
> spark.cleaner.periodicGC.interval
> 
> spark.cleaner.referenceTracking
> 
> spark.cleaner.referenceTracking.blocking.shuffle
> 
> 
> 
> Regards
> 
> Prathmesh Ranaut
>> On Jul 21, 2019, at 11:31 AM, Alex Landa <me...@gmail.com> wrote:
>> 
>> 
>> Hi,
>> 
>> We are running a long running Spark application ( which executes lots of
>> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
>> We see that old shuffle files ( a week old for example ) are not deleted
>> during the execution of the application, which leads to out of disk space
>> errors on the executor. 
>> If we re-deploy the application, the Spark cluster take care of the cleaning
>> and deletes the old shuffle data (since we have
>> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
>> I don't want to re-deploy our app every week or two, but to be able to
>> configure spark to clean old shuffle data (as it should). 
>> 
>> How can I configure Spark to delete old shuffle data during the life time of
>> the application (not after)? 
>> 
>> 
>> Thanks,
>> Alex

Re: Long-Running Spark application doesn't clean old shuffle data correctly

Posted by Alex Landa <me...@gmail.com>.
Hi Keith,

I don't think that we keep such references.
But we do experience exceptions during the job execution that we catch and
retry (timeouts/network issues from different data sources).
Can they affect RDD cleanup?

Thanks,
Alex

On Sun, Jul 21, 2019 at 10:49 PM Keith Chapman <ke...@gmail.com>
wrote:

> Hi Alex,
>
> Shuffle files in spark are deleted when the object holding a reference to
> the shuffle file on disk goes out of scope (is garbage collected by the
> JVM).  Could it be the case that you are keeping these objects alive?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
>
> On Sun, Jul 21, 2019 at 12:19 AM Alex Landa <me...@gmail.com> wrote:
>
>> Thanks,
>> I looked into these options, the cleaner periodic interval is set to 30
>> min by default.
>> The block option for shuffle -
>> *spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by
>> default.
>> What are the implications of setting it to true?
>> Will it make the driver slower?
>>
>> Thanks,
>> Alex
>>
>> On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail <
>> prathmesh.ranaut@gmail.com> wrote:
>>
>>> This is the job of ContextCleaner. There are few a property that you can
>>> tweak to see if that helps:
>>> spark.cleaner.periodicGC.interval
>>> spark.cleaner.referenceTracking
>>> spark.cleaner.referenceTracking.blocking.shuffle
>>>
>>> Regards
>>> Prathmesh Ranaut
>>>
>>> On Jul 21, 2019, at 11:31 AM, Alex Landa <me...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We are running a long running Spark application ( which executes lots of
>>> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
>>> We see that old shuffle files ( a week old for example ) are not deleted
>>> during the execution of the application, which leads to out of disk space
>>> errors on the executor.
>>> If we re-deploy the application, the Spark cluster take care of the
>>> cleaning
>>> and deletes the old shuffle data (since we have
>>> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
>>> I don't want to re-deploy our app every week or two, but to be able to
>>> configure spark to clean old shuffle data (as it should).
>>>
>>> How can I configure Spark to delete old shuffle data during the life
>>> time of
>>> the application (not after)?
>>>
>>>
>>> Thanks,
>>> Alex
>>>
>>>

Re: Long-Running Spark application doesn't clean old shuffle data correctly

Posted by Keith Chapman <ke...@gmail.com>.
Hi Alex,

Shuffle files in spark are deleted when the object holding a reference to
the shuffle file on disk goes out of scope (is garbage collected by the
JVM).  Could it be the case that you are keeping these objects alive?

Regards,
Keith.

http://keith-chapman.com


On Sun, Jul 21, 2019 at 12:19 AM Alex Landa <me...@gmail.com> wrote:

> Thanks,
> I looked into these options, the cleaner periodic interval is set to 30
> min by default.
> The block option for shuffle -
> *spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by
> default.
> What are the implications of setting it to true?
> Will it make the driver slower?
>
> Thanks,
> Alex
>
> On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail <
> prathmesh.ranaut@gmail.com> wrote:
>
>> This is the job of ContextCleaner. There are few a property that you can
>> tweak to see if that helps:
>> spark.cleaner.periodicGC.interval
>> spark.cleaner.referenceTracking
>> spark.cleaner.referenceTracking.blocking.shuffle
>>
>> Regards
>> Prathmesh Ranaut
>>
>> On Jul 21, 2019, at 11:31 AM, Alex Landa <me...@gmail.com> wrote:
>>
>> Hi,
>>
>> We are running a long running Spark application ( which executes lots of
>> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
>> We see that old shuffle files ( a week old for example ) are not deleted
>> during the execution of the application, which leads to out of disk space
>> errors on the executor.
>> If we re-deploy the application, the Spark cluster take care of the
>> cleaning
>> and deletes the old shuffle data (since we have
>> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
>> I don't want to re-deploy our app every week or two, but to be able to
>> configure spark to clean old shuffle data (as it should).
>>
>> How can I configure Spark to delete old shuffle data during the life time
>> of
>> the application (not after)?
>>
>>
>> Thanks,
>> Alex
>>
>>

Re: Long-Running Spark application doesn't clean old shuffle data correctly

Posted by Alex Landa <me...@gmail.com>.
Thanks,
I looked into these options, the cleaner periodic interval is set to 30 min
by default.
The block option for shuffle -
*spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by
default.
What are the implications of setting it to true?
Will it make the driver slower?

Thanks,
Alex

On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail <
prathmesh.ranaut@gmail.com> wrote:

> This is the job of ContextCleaner. There are few a property that you can
> tweak to see if that helps:
> spark.cleaner.periodicGC.interval
> spark.cleaner.referenceTracking
> spark.cleaner.referenceTracking.blocking.shuffle
>
> Regards
> Prathmesh Ranaut
>
> On Jul 21, 2019, at 11:31 AM, Alex Landa <me...@gmail.com> wrote:
>
> Hi,
>
> We are running a long running Spark application ( which executes lots of
> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
> We see that old shuffle files ( a week old for example ) are not deleted
> during the execution of the application, which leads to out of disk space
> errors on the executor.
> If we re-deploy the application, the Spark cluster take care of the
> cleaning
> and deletes the old shuffle data (since we have
> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
> I don't want to re-deploy our app every week or two, but to be able to
> configure spark to clean old shuffle data (as it should).
>
> How can I configure Spark to delete old shuffle data during the life time
> of
> the application (not after)?
>
>
> Thanks,
> Alex
>
>