You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "eejbyfeldt (via GitHub)" <gi...@apache.org> on 2023/09/30 09:36:35 UTC

[GitHub] [spark] eejbyfeldt opened a new pull request, #43188: SPARK-45386: Fix correctness issue with StorageLevel.NONE on Dataset

eejbyfeldt opened a new pull request, #43188:
URL: https://github.com/apache/spark/pull/43188

### What changes were proposed in this pull request?
Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using `StorageLevel.NONE`. Before that patch a query like:
```
import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()
```
would correcly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:
https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala#L294
that the input is empty. The problem is that the action that should populate the make sure the statistics are collected here
https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L285-L291
never use the iterator and when we have `StorageLevel.NONE` the persisting will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.

### Why are the changes needed?
The current code has a correctness issue.

### Does this PR introduce _any_ user-facing change?
Yes, fixes the correctness issue.

### How was this patch tested?
New and existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1746476570

   @eejbyfeldt I found there's some conflicts when I cherry-pick this commit [a0c9ab6](https://github.com/apache/spark/commit/a0c9ab63f3bcf4c9bb56c407375ce1c8cc26fb02) to spark 3.5
   
   could you file a separate PR against spark 3.5 ?  Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #43188: [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #43188:
URL: https://github.com/apache/spark/pull/43188#discussion_r1342137250


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala:
##########
@@ -113,7 +113,8 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
       planToCache: LogicalPlan,
       tableName: Option[String],
       storageLevel: StorageLevel): Unit = {
-    if (lookupCachedData(planToCache).nonEmpty) {
+    if (storageLevel == StorageLevel.NONE) {

Review Comment:
   Maybe just put a comment in the body indicating that this is intentional and why



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 merged PR #43188:
URL: https://github.com/apache/spark/pull/43188


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1746475213

   > @WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of [a0c9ab6](https://github.com/apache/spark/commit/a0c9ab63f3bcf4c9bb56c407375ce1c8cc26fb02) it did not look like you merged it using the [merge_spark_pr.py](https://github.com/apache/spark/blob/d708fd7b68bf0c9964e861cb2c81818d17d7136e/dev/merge_spark_pr.py) script?
   
   Yes, my fault :) We should use  merge_spark_pr.py  to merge it.
   
   I will backport it to spark 3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1747136587

   Thank you, @eejbyfeldt and all.
   
   To @WeichenXu123 .
   Please use our merge script. It has much more features to help Apache Spark committers.
   - https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #43188: [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset

Posted by "mridulm (via GitHub)" <gi...@apache.org>.

mridulm commented on code in PR #43188:
URL: https://github.com/apache/spark/pull/43188#discussion_r1341986626


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -3833,7 +3833,9 @@ class Dataset[T] private[sql](
    * @since 1.6.0
    */
   def persist(newLevel: StorageLevel): this.type = {
-    sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
+    if (newLevel != StorageLevel.NONE) {
+      sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
+    }

Review Comment:
   Shouldn't this not be done in `cacheQuery` instead ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "eejbyfeldt (via GitHub)" <gi...@apache.org>.

eejbyfeldt commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1742979095

   @WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of https://github.com/apache/spark/commit/a0c9ab63f3bcf4c9bb56c407375ce1c8cc26fb02 it did not look like you merged it using the [merge_spark_pr.py](https://github.com/apache/spark/blob/d708fd7b68bf0c9964e861cb2c81818d17d7136e/dev/merge_spark_pr.py) script? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1742700463

   @eejbyfeldt Spark package releasing pipeline has some issue recently, once it is fixed I will release new version graphframe for spark 3.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] eejbyfeldt commented on a diff in pull request #43188: [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset

Posted by "eejbyfeldt (via GitHub)" <gi...@apache.org>.

eejbyfeldt commented on code in PR #43188:
URL: https://github.com/apache/spark/pull/43188#discussion_r1342088619


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -3833,7 +3833,9 @@ class Dataset[T] private[sql](
    * @since 1.6.0
    */
   def persist(newLevel: StorageLevel): this.type = {
-    sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
+    if (newLevel != StorageLevel.NONE) {
+      sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
+    }

Review Comment:
   Yes, that is probably better. Updated the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.

mridulm commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1747202242

   Wondering if there is a way to disable that "squash and merge" button @dongjoon-hyun :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "eejbyfeldt (via GitHub)" <gi...@apache.org>.

eejbyfeldt commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1746599544

   
   
   
   > @eejbyfeldt I found there's some conflicts when I cherry-pick this commit [a0c9ab6](https://github.com/apache/spark/commit/a0c9ab63f3bcf4c9bb56c407375ce1c8cc26fb02) to spark 3.5
   > 
   > could you file a separate PR against spark 3.5 ? Thanks!
   
   https://github.com/apache/spark/pull/43213


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.

WeichenXu123 commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1748374483

   > Thank you, @eejbyfeldt and all.
   > 
   > To @WeichenXu123 . Please use our merge script. It has much more features to help Apache Spark committers. 😄
   > 
   > * https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py
   
   Sure. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset [spark]

Posted by "eejbyfeldt (via GitHub)" <gi...@apache.org>.

eejbyfeldt commented on PR #43188:
URL: https://github.com/apache/spark/pull/43188#issuecomment-1742564819

   > Looks OK. I think this needs to go into 3.5 too?
   
   Yes, this fix is also needed in the 3.5 branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org