You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/05 01:25:10 UTC

[GitHub] [spark] alex-balikov opened a new pull request, #37413: [SPARK-39983][CORE} Do not cache unserialized broadcast relations on the driver

alex-balikov opened a new pull request, #37413:
URL: https://github.com/apache/spark/pull/37413

### What changes were proposed in this pull request?

This PR addresses the issue raised in https://issues.apache.org/jira/browse/SPARK-39983 - broadcast relations should not be cached on the driver as they are not needed and can cause significant memory pressure (in one case the relation was 60MB )

The PR adds a new SparkContext.broadcastInternal method with parameter serializedOnly allowing the caller to specify that the broadcasted object should be stored only in serialized form. The current behavior is to also cache an unserialized form of the object.

The PR changes the broadcast implementation in TorrentBroadcast to honor the serializedOnly flag and not store the unserialized value, unless the execution is in a local mode (single process). In that case the broadcast cache is effectively shared between driver and executors and thus the unserialized value needs to be cached to satisfy the executor-side of the functionality.

### Why are the changes needed?

The broadcast relations can be fairly large (observed 60MB one) and are not needed in unserialized form on the driver.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Added a new unit test to BroadcastSuite verifying the low-level broadcast functionality in respect to the serializedOnly flag.
Added a new unit test to BroadcastExchangeSuite verifying that broadcasted relations are not cached on the driver.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org