You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/21 05:44:20 UTC

[GitHub] [spark] prakharjain09 opened a new pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

prakharjain09 opened a new pull request #28881:
URL: https://github.com/apache/spark/pull/28881


   ### What changes were proposed in this pull request?
   Create a single post-order rule for ReuseExchange and ReuseSubquery which traverses the plan in 1 single post order and replaces duplicated nodes with ReusedExchangeExec, ReuseSubqueryExec.
   
   This fixes the `ReusedExchangeExec Reference issue` where a ReusedExchangeExec points to an Exchange which doesn't exist in entire query plan.
   
   
   ### Why are the changes needed?
   
   Currently Spark do 3 iterations on plan to identify and replace nodes which can be ReusedExchangeExec and ReusedSubqueryExec:
   Phase-1: First one is done in ReuseExchange rule to replace Exchange with ReusedExchangeExec. 
   Phase-2: Seconds one is introduces by DPP in ReuseExchange rule to find out all the InSubqueryExec and traverse the plans inside it and replace relevant Exchange with ReusedSubqueryExec. 
   Phase-3: Third we do in ReuseSubquery rule to identify ExecSubqueryExpression which are reusable and replace them with ReuseSubqueryExec.
   
   When any change is done by Phase-2/Phase-3 in a subtree of Exchange, then the id of exchange will change. and sometimes this leads to another ReusedExchangeExec pointing to Exchange which doesn't exist in plan.
   
   Example: Suppose this is the plan after Phase-1 when we try to do self join of a view.
   
                                        SORTMERGEJOIN         
              Exchange (id=1234)                          ReusedExchangeExec (points-to-id=1234)
                             |
                        ChildSubtree
   
   Suppose ChildSubtree has DPP applied inside it. So Phase-2 will try to convert plan inside InSubqueryExec to use ReuseBroadcast and in that process, complete hierarchy of ChildSubtree will also change. i.e.
   
                                        SORTMERGEJOIN         
              Exchange (id=1878)                        ReusedExchangeExec (points-to-id=1234)
                             |
                       NewChildSubtree
   
   But the `ReusedExchangeExec (points-to-id=1234)` is still pointing to id 1234 and so no reuse will happen.
   
   This PR fixes this issue by merging Phase1,Phase2 and Phase3 into a single post order traversal.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added UTs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28881:
URL: https://github.com/apache/spark/pull/28881#issuecomment-647239682


   Closing as a dup


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28881:
URL: https://github.com/apache/spark/pull/28881#issuecomment-647082454


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28881:
URL: https://github.com/apache/spark/pull/28881#issuecomment-647082454


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

Posted by GitBox <gi...@apache.org>.
peter-toth commented on pull request #28881:
URL: https://github.com/apache/spark/pull/28881#issuecomment-647102573


   @prakharjain09 , it seems we both opened PRs (https://github.com/apache/spark/pull/28885 is mine) to fix the issue with exchange and subquery reuse. It looks like we came to the same conclusion that the separate reuse rules needs to be unified. My PR does a bit more that that and actually does the combined reuse in a bit different way than yours. I also see that you opened the ticket SPARK-32041 for the issue. If you don't mind I would add that ticket to my PR as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #28881:
URL: https://github.com/apache/spark/pull/28881


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] prakharjain09 commented on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

Posted by GitBox <gi...@apache.org>.
prakharjain09 commented on pull request #28881:
URL: https://github.com/apache/spark/pull/28881#issuecomment-647361298


   @peter-toth sure. Lets collaborate on #28885 to fix SPARK-32041/SPARK-28940.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28881: [SPARK-32041][SQL] Fix Exchange reuse issues when subqueries are involved

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28881:
URL: https://github.com/apache/spark/pull/28881#issuecomment-647082517


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org