You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/05 01:25:10 UTC

[GitHub] [spark] alex-balikov opened a new pull request, #37413: [SPARK-39983][CORE} Do not cache unserialized broadcast relations on the driver

alex-balikov opened a new pull request, #37413:
URL: https://github.com/apache/spark/pull/37413

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR addresses the issue raised in https://issues.apache.org/jira/browse/SPARK-39983 - broadcast relations should not be cached on the driver as they are not needed and can cause significant memory pressure (in one case the relation was 60MB )
   
   The PR adds a new SparkContext.broadcastInternal method with parameter serializedOnly allowing the caller to specify that the broadcasted object should be stored only in serialized form. The current behavior is to also cache an unserialized form of the object.
   
   The PR changes the broadcast implementation in TorrentBroadcast to honor the serializedOnly flag and not store the unserialized value, unless the execution is in a local mode (single process). In that case the broadcast cache is effectively shared between driver and executors and thus the unserialized value needs to be cached to satisfy the executor-side of the functionality.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   The broadcast relations can be fairly large (observed 60MB one) and are not needed in unserialized form on the driver.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   Added a new unit test to BroadcastSuite verifying the low-level broadcast functionality in respect to the serializedOnly flag.
   Added a new unit test to BroadcastExchangeSuite verifying that broadcasted relations are not cached on the driver.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sos3k commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
sos3k commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1235422141

   <img width="490" alt="Screenshot 2022-09-02 at 13 57 44" src="https://user-images.githubusercontent.com/2758082/188138030-e74fc0d0-4220-4ab1-a665-de907158c29a.png">
   Hi everyone, I am Radek from HuuugeGames, we use Databricks in a version of runtime 10.4 LTS and I wanted to just let you know that after including your changes to the runtime (Databricks did that at 26.08 during their maintenance) we found our job started to behave inconsistently as from time to time we are pruning all of the source files during the scanning from s3 with using dynamic file pruning. I attached the screenshot that shows lost of broadcasted data during the DFP (1 records read from Reuse Exchange, where normally there should be 97) which results no records read from the S3. We are making join between small dim and events table and definitely something is happening here. After disable of DFP the plan has changed and the process looks stable. We also got back to the previous version of the Databricks runtime image without this changes and also process looks good even when DFP is enabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37413: [SPARK-39983][CORE] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1206124066

   Doesn't this not break in local mode ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] alex-balikov commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
alex-balikov commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1210078179

   I applied @JoshRosen 's change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1207116255

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JoshRosen commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
JoshRosen commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1209009482

   > QQ @JoshRosen, @alex-balikov - if the expectation is that the variable can be recreated (if missing) at driver - with that being a remote possibility, do we want to make it a `WeakReference` instead of a `SoftReference` ? Softref's are kept around as much as possible by the jvm (usually until there are possibilities of OOM), unlike Weakref's
   
   Good question. It looks like PR https://github.com/apache/spark/pull/22995 originally proposed to make this into a `WeakReference` but ended up using a `SoftReference` in an attempt to reduce the likelihood that a broadcast variable would be cleared on GC. I'm slightly hesitant to want to change behaviors that could impact user created broadcast variables (although maybe I'm being overly-conservative).
   
   I don't think that changing `SoftReference` to `WeakReference` would negatively impact broadcast hash joins or other SQL operators that use internal broadcast variables: each of those operators retrieves the broadcast value and stores a strong reference to it for the duration of the task, so as long as one task is running the WeakReference here and in the cache added in https://github.com/apache/spark/pull/20183 should still be alive, so the cache will be effective in keeping the broadcast variable value from being duplicated.
   
   I agree that the `WeakReference` could make sense for these internal broadcast variables on the driver, though. It looks like `SoftReference` and `WeakReference` share a common superclass, so we can flag the reference type based on whether this is an internal or external broadcast variable. Here's a patch implementing that idea : https://github.com/JoshRosen/spark/commit/01fd8be179274d3fe298f691264e2e1bd37bca2e
   
   @alex-balikov, @mridulm, WDYT? If that WeakRef change makes sense to you, let's cherry-pick it into this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1206775251

   @JoshRosen Ah yes, missed out on that - should have taken a more detailed look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JoshRosen commented on pull request #37413: [SPARK-39983][CORE] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
JoshRosen commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1206762340

   @mridulm, it doesn't break local mode because there's a carve-out to preserve the existing behavior in that case: in both places where the `if(serializedOnly` check changes behavior, there's a check for `isLocalMaster` to avoid behavior changes:
   
   We'll still store the original object in the driver block manager at write time in local mode:
   
   https://github.com/apache/spark/blob/75ab18ee0e382b8117bf65fc9ef05190d4fdf01a/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L133-L136
   
   There's a similar carve-out in `readBroadcastBlock` (although I don't think we'd ever actually hit that branch in local mode given that we would have already stored the re-assembled broadcast block in `writeBlocks`):
   
   https://github.com/apache/spark/blob/75ab18ee0e382b8117bf65fc9ef05190d4fdf01a/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L277-L284
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1208697473

   QQ @JoshRosen, if the expectation is that the variable can be recreated (if missing) at driver - with that being a remote possibility, do we want to make it a `WeakReference` instead of a `SoftReference` ?
   SoftReference's are fairly aggressively kept around (usually until there are possibilities of OOM).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JoshRosen closed pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
JoshRosen closed pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver
URL: https://github.com/apache/spark/pull/37413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JoshRosen commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

Posted by GitBox <gi...@apache.org>.
JoshRosen commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1236173944

   Hi @sos3k,
   
   We investigated your bug report and determined that the root-cause was a latent bug in `UnsafeHashedRelation` that was triggered more frequently following the backport of this PR.
   
   PR https://github.com/apache/spark/pull/35836 fixed a bug related to driver-side deserialization of `UnsafeHashedRelation`, but that fix was missing from DBR 9.1 and 10.4 (but is present in all newer versions). We have deployed a hotfix to backport https://github.com/apache/spark/pull/35836 into DBR 9.1 and 10.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org