You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wenyuen-db (via GitHub)" <gi...@apache.org> on 2023/08/23 21:01:31 UTC

[GitHub] [spark] wenyuen-db opened a new pull request, #42635: [SPARK-44934][SQL] Use outputSet instead of output to determine if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

wenyuen-db opened a new pull request, #42635:
URL: https://github.com/apache/spark/pull/42635

   
   
   ### What changes were proposed in this pull request?
   
   Originally, when a CTE has duplicate expression IDs in its output, the rule PushdownPredicatesAndPruneColumnsForCTEDef wrongly assesses that the columns in the CTE were pruned, as it compares the size of the attribute set containing the union of columns (which is unique) and the original output of the CTE (which contains duplicate columns) and notices that the former is less than the latter. This causes incorrect pruning of the CTE output, resulting in a missing reference and causing the error as documented in the ticket.
   
   This PR changes the logic to use the needsPruning function to assess whether a CTE has been pruned, which uses the outputSet to check if any columns has been pruned instead of the output.
   
   ### Why are the changes needed?
   
   The incorrect behaviour of PushdownPredicatesAndPruneColumnsForCTEDef in CTEs with duplicate expression IDs in its output causes a crash when such a query is run.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Unit test for the crashing case was added.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wenyuen-db commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "wenyuen-db (via GitHub)" <gi...@apache.org>.
wenyuen-db commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1692223550

   Seems like there is an analyzer issue in 3.3 and 3.4 with dealing with CTEs with duplicate expression IDs that cause the query to still fail in 3.3 and 3.4 despite the fix - will investigate further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wenyuen-db commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "wenyuen-db (via GitHub)" <gi...@apache.org>.
wenyuen-db commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1692251962

   This discrepancy seems to be caused by the change for [SPARK-43838](https://issues.apache.org/jira/browse/SPARK-43838), specifically this [PR](https://github.com/apache/spark/pull/41347/files) that lets the analyzer rule `DeduplicateRelations` handle CTEs with duplicate exprIds without throwing the analyzer error we observe in 3.3 and 3.4.
   
   Should we try to pull out parts of the SPARK-43838 fix and backport them to 3.3 and 3.4, or simply leave the bugfix in 3.5+ only? @sigmod


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wenyuen-db closed pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to determine if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "wenyuen-db (via GitHub)" <gi...@apache.org>.
wenyuen-db closed pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to determine if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef
URL: https://github.com/apache/spark/pull/42635


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1692170578

   This is reverted from branch-3.4 via https://github.com/apache/spark/commit/6b6a0d501b65924c836abc1677267cc292566c80 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sigmod commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to determine if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "sigmod (via GitHub)" <gi...@apache.org>.
sigmod commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1690658124

   @cloud-fan @gengliangwang @peter-toth 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1691865804

   Thanks for the PR! Merged to master, 3.5, 3.4 and 3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1692634817

   It seems hard to decide which part to extract from [SPARK-43838](https://issues.apache.org/jira/browse/SPARK-43838) , I'm fine with leaving the fix in 3.5+ only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth closed pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth closed pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef
URL: https://github.com/apache/spark/pull/42635


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1692665734

   +1 for keeping this in 3.5+ only. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wenyuen-db commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "wenyuen-db (via GitHub)" <gi...@apache.org>.
wenyuen-db commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1691820263

   @peter-toth could you help me merge the PR? :) thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #42635: [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #42635:
URL: https://github.com/apache/spark/pull/42635#issuecomment-1692166264

   Hi, @wenyuen-db , @sigmod , @peter-toth .
   
   Unfortunately, newly added test cases seem to break `branch-3.4` due to some difference among branches. I also verified the failure locally.
   
   https://github.com/apache/spark/actions/runs/5965492707/job/16187941738
   
   ![Screenshot 2023-08-24 at 10 46 37 AM](https://github.com/apache/spark/assets/9700541/aee37588-96bb-488c-ba09-41abf6340f9c)
   
   Let me revert this from 3.4 and 3.3. Please make two independent backporting PRs for branch-3.4 and branch-3.3, respectively, @wenyuen-db .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org