You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/09 12:15:01 UTC

[GitHub] [spark] mcdull-zhang opened a new pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

mcdull-zhang opened a new pull request #34908:
URL: https://github.com/apache/spark/pull/34908


   ### What changes were proposed in this pull request?
   
   Each child of the union handles data skew separately.
   
   
   ### Why are the changes needed?
   `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec.
   
   With `Union`, it might break the assumption. For example, the following plans
   
   <b>scenes 1</b>
   ```
   Union
       SMJ
           ShuffleQueryStage
           ShuffleQueryStage
       SMJ
           ShuffleQueryStage
           ShuffleQueryStage
   ```
   
   <b>scenes 2</b>
   ```
   Union
       SMJ
           ShuffleQueryStage
           ShuffleQueryStage
       HashAggregate
   ```
   when one or more of the SMJ data in the above plan is skewed, it cannot be processed at present.
   
   It's better to support partial optimize with Union.
   
   ### Does this PR introduce any user-facing change?
   
   Probably yes, the result partition might changed.
   
   ### How was this patch tested?
   
   Add test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34908:
URL: https://github.com/apache/spark/pull/34908#issuecomment-1002609125


   I think https://github.com/apache/spark/pull/34974 can handle this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mcdull-zhang commented on pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
mcdull-zhang commented on pull request #34908:
URL: https://github.com/apache/spark/pull/34908#issuecomment-995478908


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mcdull-zhang closed pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
mcdull-zhang closed pull request #34908:
URL: https://github.com/apache/spark/pull/34908


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mcdull-zhang commented on pull request #34908: [SPARK-37652][SQL]Add test for optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
mcdull-zhang commented on pull request #34908:
URL: https://github.com/apache/spark/pull/34908#issuecomment-1033701050


   @ulysses-you  @cloud-fan  Please take a look, does the test code make sense? If it doesn't make sense, I'll turn off the pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34908:
URL: https://github.com/apache/spark/pull/34908#issuecomment-1002612432


   Yea, I think this is already supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #34908: [SPARK-37652][SQL]Add test for optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #34908:
URL: https://github.com/apache/spark/pull/34908


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34908:
URL: https://github.com/apache/spark/pull/34908#issuecomment-994784857


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34908: [SPARK-37652][SQL]Add test for optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34908:
URL: https://github.com/apache/spark/pull/34908#issuecomment-1033785068


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on pull request #34908:
URL: https://github.com/apache/spark/pull/34908#issuecomment-1002835758


   Although we have supported it, I think it's still good to add some test. @mcdull-zhang can you rebase this PR only for the test ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org