You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Min Shen (Jira)" <ji...@apache.org> on 2020/12/14 18:58:00 UTC

[jira] [Created] (SPARK-33781) Improve caching of MergeStatus on the executor side to save memory

Min Shen created SPARK-33781:
--------------------------------

             Summary: Improve caching of MergeStatus on the executor side to save memory
                 Key: SPARK-33781
                 URL: https://issues.apache.org/jira/browse/SPARK-33781
             Project: Spark
          Issue Type: Sub-task
          Components: Spark Core
    Affects Versions: 3.1.0
            Reporter: Min Shen


In MapOutputTrackerWorker, it would cache the retrieved MapStatus or MergeStatus array for a given shuffle received from the driver in memory so that all tasks doing shuffle fetch for that shuffle can reuse the cached metadata.

However, different from MapStatus array, where each task would need to access every single instance in the array, each task would only need one or just a few MergeStatus objects from the MergeStatus array depending on which shuffle partitions the task is processing.

For large shuffles with 10s or 100s of thousands of shuffle partitions, caching the entire deserialized and decompressed MergeStatus array on the executor side, while perhaps only 0.1% of them are going to be used by the tasks running in this executor is a huge waste of memory.

We could improve this by caching the serialized and compressed bytes for MergeStatus array instead and only cache the needed deserialized MergeStatus object on the executor side. In addition to saving memory, it also helps with reducing GC pressure on executor side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org