You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Matthias Boehm <mb...@googlemail.com> on 2017/09/20 17:29:45 UTC

Broadcast Memory Management

Hi all,

could someone please help me understand the broadcast life cycle in detail,
especially with regard to memory management?

After reading through the TorrentBroadcast implementation, it seems that
for every broadcast object, the driver holds a strong reference to a
shallow copy (in MEMORY_AND_DISK) as well as a deep copy of the data in
chunked form (in MEMORY_AND_DISK). Now my questions:

1) Is this observation correct or does the driver also hold a strong
reference to the entire object in serialized form?

2) Are there scenarios, other than with local master or explicit reads in
the driver, where the shallow copy is actually used by Spark?

3) Is it a valid workaround to create a wrapper object around the data,
broadcast the wrapper, and immediately delete the data after it has been
blockified to remove the unnecessary memory requirements?


Regards,
Matthias