You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "crepererum (via GitHub)" <gi...@apache.org> on 2023/04/14 09:41:19 UTC

[GitHub] [arrow-datafusion] crepererum opened a new pull request, #6009: fix: remove partition aware union logic

crepererum opened a new pull request, #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009

   # Which issue does this PR close?
   Fixes #5970.
   
   # Rationale for this change
   `UnionExec` shall be a simple node and not perform some optimizations on its own. If repartition / partition concatenation is desired, then there should be an optimizer pass for that. Being a simple node avoids confusion and optimizer bugs like #5970.
   
   # What changes are included in this PR?
   Remove the entire "partition aware" logic from `UnionExec`
   
   # Are these changes tested?
   1. Existing tests pass
   2. Test for #5970 passes.
   
   # Are there any user-facing changes?
   Less bugs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6009: fix: remove partition aware union logic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1509128673

   Update: 🤔  it seems from my experiment on  https://github.com/apache/arrow-datafusion/pull/6013 that we have no explain coverage (at all) for UnionExec with preserve partitioning.
   
   @mingmwang  do you know of tests / examples of it being used so that I can add coverage?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6009: fix: remove partition aware union logic

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1508280788

   > > And I do not think it is bug.
   > 
   > I mean #5970 clearly shows that there is SOME bug in the DF code base and as illustrated there different parts of the code could be hold responsible for it. So while the technical bug is probably in the optimizer, the fact that `UnionExec` has its own internal optimizer and that people clearly don't know about it (also because it's not really documented) made me conclude that we should probably have a cleaner design instead of fixing more edge cases that are a consequence of the current design.
   
   I get your points.  Agree that this internal optimization logic inside the UnionExec is not visible to others, this is not good.  
   
   Basically, we need to consider the following requirements
   1.   Sometimes we would like the UnionExec to keep ordering
   2.  Sometimes we would like the UnionExec to keep partition
   3.  The physical plan should show or display the `UnionExec` is `partition-aware` or `ordering-aware` clearly.
   4. The physical optimizers rules(Enforce sort/enforce distribution) need to handle all the different cases correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6009: fix: remove partition aware union logic

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1512796387

   @alamb 
   Originally, the UT `partition_aware_union` in dataframe.rs should cover the basic correctness and partition count test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6009: fix: remove partition aware union logic

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1512838437

   Sorry that @5970 is my bug,  I wrote the first version of the Top-Down sort enforcement and forgot to handle the `partition-aware` union cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on pull request #6009: fix: remove partition aware union logic

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1508287316

   Good points. I esp. like the "keep ordering VS keep partition" semantics. How about we split `UnionExec` into two node types then? I think this would avoid a lot of confusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6009: fix: remove partition aware union logic

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1508251440

   Could you please add a flag to turn on/off the partition aware logic ?
   Originally I add the logic is because there is an open ticket for this.
   https://github.com/apache/arrow-datafusion/issues/189
   
   And in SparkSQL, it has similar optimization.
   
   For `UnionExec`, keep the ordering or keep the partitioning are conflict requirements. In some case we would like to keep
   the ordering info but in some other cases/systems(distributed systems), we want to keep the partitioning to avoid data shuffle.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6009: fix: remove partition aware union logic

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1508296731

   > Good points. I esp. like the "keep ordering VS keep partition" semantics. How about we split `UnionExec` into two node types then? I think this would avoid a lot of confusion.
   
   At the very begin, I thought we could have two types of Union, a normal `UnionExec` and `PartitionAwareUnionExec`. We could have a physical rule to replace `UnionExec` to `PartitionAwareUnionExec` if capable, and a configure option to turn on/off the rule.  But because the partition and ordering info are not static, they are changed dynamically.  I not sure how this rule will interact with the enforcer rules. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6009: fix: remove partition aware union logic

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1512834451

   @crepererum 
   Regarding the bug fix, if we still want to keep the functionality of `partition-aware` Union, you can fix the rule `SortPushDown` and do not push down the sort requirements for partition-aware Union.
   
   In the method `pushdown_requirement_to_children()`
   
   ```rust
   else if is_union(plan) {
           // UnionExec does not have real sort requirements for its input. Here we change the adjusted_request_ordering to UnionExec's output ordering and
           // propagate the sort requirements down to correct the unnecessary descendant SortExec under the UnionExec
           Ok(Some(vec![
               parent_required.map(|elem| elem.to_vec());
               plan.children().len()
           ]))
   }
   ```
   
   change to
   
   ```rust
   else if let Some(UnionExec {
           partition_aware: false, ..
       }) = plan.as_any().downcast_ref::<UnionExec>()
       {
           // UnionExec does not have real sort requirements for its input. Here we change the adjusted_request_ordering to UnionExec's output ordering and
           // propagate the sort requirements down to correct the unnecessary descendant SortExec under the UnionExec
           Ok(Some(vec![
               parent_required.map(|elem| elem.to_vec());
               plan.children().len()
           ]))
       }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on pull request #6009: fix: remove partition aware union logic

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1513272987

   Filed #6045 as a more advanced solution. I think this is the best of both worlds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on pull request #6009: fix: remove partition aware union logic

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1508263550

   > And I do not think it is bug.
   
   I mean #5970 clearly shows that there is SOME bug in the DF code base and as illustrated there different parts of the code could be hold responsible for it. So while the technical bug is probably in the optimizer, the fact that `UnionExec` has its own internal optimizer and that people clearly don't know about it (also because it's not really documented) made me conclude that we should probably have a cleaner design instead of fixing more edge cases that are a consequence of the current design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on pull request #6009: fix: remove partition aware union logic

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1511046388

   This week I plan to extend this PR by the new node type (partition aware Union / interleave) and an optimizer pass that inserts it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6009: fix: remove partition aware union logic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1509099461

   I spent a non trivial time thinking about what a "PartitionAware" Union even means 
   
   It is entirely undocumented in 
    https://docs.rs/datafusion/22.0.0/datafusion/physical_plan/union/struct.UnionExec.html
   
   Thanks to @mingmwang and @crepererum 's comment on  https://github.com/apache/arrow-datafusion/issues/5970#issuecomment-1508165252
   
   Given the operation seems so different, if we are going to keep the partition aware union, I think we should use a different structure name. Maybe we could call what is currently named "UnionExec with preserve partitioning"  as `Interleave` --  that  would imply the data from the different partitions was kept segregated in their own partitions but interleaved in the output partition streams. 
   
   
   > The physical plan should show or display the UnionExec is partition-aware or ordering-aware clearly.
   
   I will try and make a PR to do this to make the current state of affairs easier to understand
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #6009: fix: remove partition aware union logic

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1508254108

   And I do not think it is bug.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on pull request #6009: fix: remove partition aware union logic

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on PR #6009:
URL: https://github.com/apache/arrow-datafusion/pull/6009#issuecomment-1508259182

   @mingmwang do you think we could change the optimizer to add a `RepartitionExec` or some "combine exec" for that special case w/o overloading `UnionExec`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed pull request #6009: fix: remove partition aware union logic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #6009: fix: remove partition aware union logic
URL: https://github.com/apache/arrow-datafusion/pull/6009


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org