You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "ahshahid (via GitHub)" <gi...@apache.org> on 2023/11/15 22:00:04 UTC

[PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

ahshahid opened a new pull request, #43824:
URL: https://github.com/apache/spark/pull/43824

   ### What changes were proposed in this pull request?
   The main change in this PR is to augment the trait of SupportsRuntimeV2Filtering by adding two new methods
   `default boolean equalToIgnoreRuntimeFilters(Scan other) {
       return this.equals(other);
     }
   
     default int hashCodeIgnoreRuntimeFilters() {
       return this.hashCode();
     }`
   
   which the underlying V2 Scan should implement.
   The BatchScanExec also gets modified accordingly where it invokes this two methods to check the equality of the Scan.
   
   Pls note that this PR includes code of 2 other PRs too
   1) [SPARK-45658](https://github.com/apache/spark/pull/43737)
   This PR though not required per se, but is good to have for correctness ( & my other PR for broadcast var pushdown relies on this  fix)
   2) [SPARK-45926](https://github.com/apache/spark/pull/43808)
   This PR is necessary to reproduce the issue and hence its code is needed for this PR to show the issue.
   
   **Also for this test to pass the code of DataSourceV2Relation.computeStats should disable throwing assertion error in testing, as that is a separate bug which gets hit, when the bug test for this PR is run.**
   
   ### Why are the changes needed?
   This change is need IMO to fix the issue of re-use of exchange not happening when DPP filters are pushed to the scan level.
   The issue is this:
   In certain types of queries for eg TPCDS Query 14b, the reuse of exchange does not happen in AQE , resulting in perf degradation.
   The spark TPCDS tests are unable to catch the problem, because the InMemoryScan used for testing do not implement the equals & hashCode correctly , in the sense , that they do take into account the pushed down run time filters.
   
   In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the equality check , apart from other things, also involves Runtime Filters pushed ( which is correct).
   
   Below is description of how this issue surfaces.
   For a given stage being materialized, just before materialization starts, the run time filters are confined to the BatchScanExec level.
   Only when the actual RDD corresponding to the BatchScanExec, is being evaluated, do the runtime filters get pushed to the underlying Scan.
   
   Now if a new stage is created and it checks in the stageCache using its canonicalized plan to see if a stage can be reused, it fails to find the r-usable stage even if the stage exists, because the canonicalized spark plan present in the stage cache, has now the run time filters pushed to the Scan , so the incoming canonicalized spark plan does not match the key as their underlying scans differ . that is incoming spark plan's scan does not have runtime filters , while the canonicalized spark plan present as key in the stage cache has the scan with runtime filters pushed.
   
   The fix as I have worked is to provide, two methods in the SupportsRuntimeV2Filtering interface ,
   default boolean equalToIgnoreRuntimeFilters(Scan other)
   
   { return this.equals(other); }
   default int hashCodeIgnoreRuntimeFilters()
   
   { return this.hashCode(); }
   In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters
   
   And the underlying Scan implementations should provide equality which excludes run time filters.
   
   Similarly the hashCode of BatchScanExec, should use scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode).
   
   ### Does this PR introduce _any_ user-facing change?
   No. But the respective DataSourceV2Relations may need to augment their code.
   
   ### How was this patch tested?
   Added bug test for the same.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.
ahshahid commented on PR #43824:
URL: https://github.com/apache/spark/pull/43824#issuecomment-1813334777

   I will add the documentation to the new methods in next commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.
ahshahid commented on PR #43824:
URL: https://github.com/apache/spark/pull/43824#issuecomment-1815264581

   To see the bug, you can do any of the following:
   1)  Comment out the implementation of equalsIgnoreRuntimeFilter and hashCodeIgnoreRuntimeFilters from InMemoryV2FilterBatchScan.
   or
   2) Take the BatchScanExec from master and update the code and run the test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #43824: [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters  are pushed down to the underlying Scan (like iceberg)
URL: https://github.com/apache/spark/pull/43824


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.
ahshahid commented on PR #43824:
URL: https://github.com/apache/spark/pull/43824#issuecomment-1815043359

   Actual source code changes are relatively few only 3   - 4 files modified. rest is test resources
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #43824:
URL: https://github.com/apache/spark/pull/43824#issuecomment-1962766688

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org