You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/26 03:33:14 UTC

[GitHub] [spark] windpiger opened a new pull request #24215: [SPARK-27229][SQL] GroupBy Placement in Intersect Distinct

windpiger opened a new pull request #24215: [SPARK-27229][SQL] GroupBy Placement in Intersect Distinct
URL: https://github.com/apache/spark/pull/24215
 
 
   ## What changes were proposed in this pull request?
   
   Intersect  operator will be replace by Left Semi Join in Optimizer.
   
   for example:
   ```
   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
   ```
   
   if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the table data before
   Join by place groupby operator under join, that is 
   ```
   ==>  
   SELECT a1, a2 FROM 
      (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
      LEFT SEMI JOIN 
      (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
   ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
   ```
   
   then we can have smaller table data when execute join, because  group by has cut lots of 
    data.
    
   ## How was this patch tested?
   uinit test added
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org