You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/18 20:53:21 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue, #2573: Improve UX for `UNION` vs `UNION ALL`

andygrove opened a new issue, #2573:
URL: https://github.com/apache/arrow-datafusion/issues/2573

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   The current support for `UNION` vs `UNION ALL` is confusing to me. 
   
   - The logical plan simply has the `Union` operator and this is currently assumed to represent `UNION ALL`. 
   - The SQL planner has logic to wrap a union with an aggregate query and that seems like something we would want in the physical plan but not in the logical plan
   - The DataFrame API does not have a function for performing a regular `UNION`. It is possible to manually wrap a union in a distinct to achieve this but that might not be obvious to users. Also, the documentation for `distinct` is incorrect and says that it performs a `union`
   
   Here is our logical plan for a UNION query:
   
   ```
   ❯ explain select * from foo union select * from foo;
   +---------------+----------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                           |
   +---------------+----------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #a, #b                                                                                             |
   |               |   Aggregate: groupBy=[[#a, #b]], aggr=[[]]                                                                     |
   |               |     Union                                                                                                      |
   |               |       Projection: #foo.a, #foo.b                                                                               |
   |               |         TableScan: foo projection=Some([0, 1])                                                                 |
   |               |       Projection: #foo.a, #foo.b                                                                               |
   |               |         TableScan: foo projection=Some([0, 1])                                                                 |
   +---------------+----------------------------------------------------------------------------------------------------------------+
   ```
   
   **Describe the solution you'd like**
   I think what we want is:
   - Logical plan should represent `UNION` vs `UNION ALL`
   - The DataFrame API should have functions for both `union` (existing method representing `UNION ALL`) and `union_distinct`
   - The physical planner (or maybe an optimization rule) should translate a `UNION` to an aggregate query
   
   **Describe alternatives you've considered**
   Leave things as they are and improve the documentation.
   
   **Additional context**
   For users of DataFusion for SQL query planning, it would be easier to map union/union all to other engines rather than trying to reverse engineer the aggregate query wrapping the union.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2573: Improve UX for `UNION` vs `UNION ALL`

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2573:
URL: https://github.com/apache/arrow-datafusion/issues/2573#issuecomment-1131546072

   Given the usecase is trying to map Union plans to what Spark does, using an explicit `Distinct` in the logical plan (that is converted into a `GroupBy` when doing physical planning makes sense to me 👍 
   
   Doing the UNION --> GroupBy transformation during SQL planning was probably a shortcut.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #2573: Improve UX for `UNION` vs `UNION ALL`

Posted by GitBox <gi...@apache.org>.

andygrove commented on issue #2573:
URL: https://github.com/apache/arrow-datafusion/issues/2573#issuecomment-1130885358

   @alamb I would appreciate your insights here when you have time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove closed issue #2573: Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct)

Posted by GitBox <gi...@apache.org>.

andygrove closed issue #2573: Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct)
URL: https://github.com/apache/arrow-datafusion/issues/2573


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mrob95 commented on issue #2573: Improve UX for `UNION` vs `UNION ALL` (introduce a LogicalPlan::Distinct)

Posted by GitBox <gi...@apache.org>.

mrob95 commented on issue #2573:
URL: https://github.com/apache/arrow-datafusion/issues/2573#issuecomment-1165920205

   Will take a look at this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org