You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/26 10:35:34 UTC

[GitHub] [arrow-datafusion] yjshen opened a new issue, #2509: Support optional filter in SortMergeJoin

yjshen opened a new issue, #2509:
URL: https://github.com/apache/arrow-datafusion/issues/2509

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   It would be necessary to support filters in the join operator, instead of a join operator followed by a filter, the necessity comes from two folds:
   
   1. semi-join that columns from one table would be removed after the join, which would make a filter that references both sides impossible. 
   2. filter out records earlier to reduce the need for constructing more records(batches).
   
   **Describe the solution you'd like**
   1. `pub type FilterOn = (Vec<Column>, Vec<Column>, Arc<dyn PhysicalExpr>)`; and `Option<FilterOn>` for JoinExec.
   2. generate record batch using arrays in the filter expr, but rebind the expr to point to the right columns of the newly generated batch.
   
   **Describe alternatives you've considered**
   1. `pub type FilterOn = Vec<(Column, Column, datafusion_expr::Operator)>;`
   2. Normalize each filter into two sides of a binary op.  like: `t1.a + t2.b > 100` to `t1.a > 100 - t2.b`. evaluates `a` , `100-b` separately as two columns and apply binary expr calculation logic.
   
   But the approach would be quite limited since it greatly limits the expressions that could be used in a join filter.
   
   **Additional context**
   
   Consider Part of TPC-DS query-95's SparkSQL plan as an example:
   ```
   +- SortMergeJoin [ws_order_number#251], [ws_order_number#285], Inner, NOT (ws_warehouse_sk#249 = ws_warehouse_sk#283)
      :- Sort [ws_order_number#251 ASC NULLS FIRST], false, 0
      :  +- CustomShuffleReader coalesced
      :     +- ShuffleQueryStage 4
      :        +- ReusedExchange [ws_warehouse_sk#249, ws_order_number#251], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]
      +- Sort [ws_order_number#285 ASC NULLS FIRST], false, 0
         +- CustomShuffleReader coalesced
            +- ShuffleQueryStage 5
               +- ReusedExchange [ws_warehouse_sk#283, ws_order_number#285], ArrowShuffleExchange hashpartitioning(ws_order_number#125, 200), true, [id=#226]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2509: Support optional filter in SortMergeJoin

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2509:
URL: https://github.com/apache/arrow-datafusion/issues/2509#issuecomment-1139077728

   @yjshen  do you mind if I close this ticket and reopen another describing the support needed for Sort-merge? I think it might be clearer to a future reader that we just needed to extend the support in HashJoin to MergeJoin whereas the description of this ticket now may confuse people as it talks about differing implementation possibilities


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on issue #2509: Support optional filter in Join

Posted by GitBox <gi...@apache.org>.
yjshen commented on issue #2509:
URL: https://github.com/apache/arrow-datafusion/issues/2509#issuecomment-1123417465

   A related issue #2496.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on issue #2509: Support optional filter in SortMergeJoin

Posted by GitBox <gi...@apache.org>.
yjshen commented on issue #2509:
URL: https://github.com/apache/arrow-datafusion/issues/2509#issuecomment-1138394615

   Reopened and renamed to track sort-merge join filter as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen closed issue #2509: Support optional filter in SortMergeJoin

Posted by GitBox <gi...@apache.org>.
yjshen closed issue #2509: Support optional filter in SortMergeJoin
URL: https://github.com/apache/arrow-datafusion/issues/2509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #2509: Support optional filter in Join

Posted by GitBox <gi...@apache.org>.
alamb closed issue #2509: Support optional filter in Join
URL: https://github.com/apache/arrow-datafusion/issues/2509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on issue #2509: Support optional filter in SortMergeJoin

Posted by GitBox <gi...@apache.org>.
yjshen commented on issue #2509:
URL: https://github.com/apache/arrow-datafusion/issues/2509#issuecomment-1139168150

   Get it; I will open a new issue instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org