You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Thomas Graves (Jira)" <ji...@apache.org> on 2022/03/25 18:05:00 UTC

[jira] [Resolved] (SPARK-37618) Support cleaning up shuffle blocks from external shuffle service

     [ https://issues.apache.org/jira/browse/SPARK-37618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves resolved SPARK-37618.
-----------------------------------
    Fix Version/s: 3.3.0
         Assignee: Adam Binford
       Resolution: Fixed

> Support cleaning up shuffle blocks from external shuffle service
> ----------------------------------------------------------------
>
>                 Key: SPARK-37618
>                 URL: https://issues.apache.org/jira/browse/SPARK-37618
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Adam Binford
>            Assignee: Adam Binford
>            Priority: Major
>             Fix For: 3.3.0
>
>
> Currently shuffle data is not cleaned up when an external shuffle service is used and the associated executor has been deallocated before the shuffle is cleaned up. Shuffle data is only cleaned up once the application ends.
> There have been various issues filed for this:
> https://issues.apache.org/jira/browse/SPARK-26020
> https://issues.apache.org/jira/browse/SPARK-17233
> https://issues.apache.org/jira/browse/SPARK-4236
> But shuffle files will still stick around until an application completes. Dynamic allocation is commonly used for long running jobs (such as structured streaming), so any long running jobs with a large shuffle involved will eventually fill up local disk space. The shuffle service already supports cleaning up shuffle service persisted RDDs, so it should be able to support cleaning up shuffle blocks as well once the shuffle is removed by the ContextCleaner. 
> The current alternative is to use shuffle tracking instead of an external shuffle service, but this is less optimal from a resource perspective as all executors must be kept alive until the shuffle has been fully consumed and cleaned up (and with the default GC interval being 30 minutes this can waste a lot of time with executors held onto but not doing anything).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org