You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/02 12:07:19 UTC

[GitHub] [spark] sos3k commented on pull request #37413: [SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver

sos3k commented on PR #37413:
URL: https://github.com/apache/spark/pull/37413#issuecomment-1235422141

   <img width="490" alt="Screenshot 2022-09-02 at 13 57 44" src="https://user-images.githubusercontent.com/2758082/188138030-e74fc0d0-4220-4ab1-a665-de907158c29a.png">
   Hi everyone, I am Radek from HuuugeGames, we use Databricks in a version of runtime 10.4 LTS and I wanted to just let you know that after including your changes to the runtime (Databricks did that at 26.08 during their maintenance) we found our job started to behave inconsistently as from time to time we are pruning all of the source files during the scanning from s3 with using dynamic file pruning. I attached the screenshot that shows lost of broadcasted data during the DFP (1 records read from Reuse Exchange, where normally there should be 97) which results no records read from the S3. We are making join between small dim and events table and definitely something is happening here. After disable of DFP the plan has changed and the process looks stable. We also got back to the previous version of the Databricks runtime image without this changes and also process looks good even when DFP is enabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org