You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/09 16:29:00 UTC

[GitHub] [arrow-datafusion] mingmwang commented on issue #4140: Support more expressions in equality join

mingmwang commented on issue #4140:
URL: https://github.com/apache/arrow-datafusion/issues/4140#issuecomment-1309020499

   I can take a look at the issue. 
   My current preference is to keep the equal join condition as columns, but add another projection to project the expression to normal columns and push down the projection.
   
   The major reason is 
   1) Complex equals join conditions will cause duplicate calculation:
   `A inner join B on (udf1(A.a) = udf2(B.b))`
   
   The first time calculation happens when during the `Repartition`, `Repartition expr `will take the exprs now and need to evaluate.
   The second time calculation happens when building the hashtable or comparing.
   
   2) Some guys might write wrong expressions as join conditions
   `A inner join B on (A.a = B.b = B.c)`
   This expression is actually a valid expression and can be evaluated, but could cause wrong join result, similar to cross join.
   
   ```
   A.a = B.b -- result of evaluation is bool type
   (A.a = B.b) = cast(B.c as bool)
   ```
   
   Maybe we can also take a complex approach and check the expression's complexity 
   1) if the expressions are trival cast, support them as equal join conditions
   
   logical plan
   
   ```
   A inner join B on (Cast(A.a) = Cast(B.b))
      TableScanA
      TableScanB
   ```
   
   physical plan
   
   ```
   A inner join B on (Cast(A.a) = Cast(B.b))
      Repartition(Cast(A.a))
          TableScanA
      Repartition(Cast(B.b))
          TableScanB
   ```
   2)  if the expressions are complex, add projections
   
   logical plan
   ```
   A inner join B on (udf1(A.a) = udf2(B.b + xxxx))
      TableScanA
      TableScanB
   ```
   
   physical plan
   ```
   A inner join B on (col1 = col2)
      Repartition(col1)
      Projection(udf1(A.a) as col1, A.*)
          TableScanA
      Repartition(col2)
      Projection(udf2(B.b + xxxx) as col2, B.*)
          TableScanB
   ```
   
   3)  if the expressions are suspicious, return error in logical plan
   return Error.
   `A inner join B on (A.a = B.b = B.c)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org