You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "robreeves (via GitHub)" <gi...@apache.org> on 2023/05/10 19:17:26 UTC

[GitHub] [spark] robreeves commented on pull request #40812: [SPARK-43157][SQL] Clone InMemoryRelation cached plan to prevent cloned plan from referencing same objects

robreeves commented on PR #40812:
URL: https://github.com/apache/spark/pull/40812#issuecomment-1542680938

   > I agree that we do need to copy the cached plan when getting it, but I feel the current change is hard to understand and reason about.
   > 
   > I think a better place to fix is `CacheManager#useCachedDataInternal`. It does copy `InMemoryRelation` but does not copy the physical plan: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L298-L306
   > 
   > We can add a new method `InMemoryRelation#freshCopy` which takes new output attributes and copy the physical plan, then we change `cached.cachedRepresentation.withOutput(currentFragment.output)` to `cached.cachedRepresentation.freshCopy(currentFragment.output)`
   
   The challenge with this approach is that `InMemoryRelation` does not get the physical plan passed in directly. It is through `CachedRDDBuilder`. I originally tried making a new `CachedRDDBuilder` with a copy of the physical plan, but it broke other tests due to the private state it maintains.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org