You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/08 17:07:49 UTC

[GitHub] [arrow-datafusion] alamb commented on pull request #1776: Update `ExecutionPlan` to know about sortedness and repartitioning optimizer pass respect the invariants

alamb commented on pull request #1776:
URL: https://github.com/apache/arrow-datafusion/pull/1776#issuecomment-1032853053


   > However, I am a little bit uncertain about `output_ordering`. My understanding is it is present to allow repartitioning of branches with order-sensitive operators, such as limit, but no explicit order.
   
   I think that is correct. The specific case that `output_order` is required at the moment to get correct is distinguishing between
   
   ```
   Limit
   Filter
   Scan
   ```
   
   And 
   ```
   Limit
   Sort
   Scan
   ```
   
   
   > I worry that this will lead two classes of hard to track down bugs:
   > 
   > 1. ExecutionPlan that incorrectly report `None` for `output_ordering`
   > 2. Plans that make assumptions about ordering without encoding this into Datafusion
   
   yes, I think these are indeed two classes of hard to track down bugs that can/will occur if DataFusion starts optimizing based on sort orders.  (cc @NGA-TRAN). One might argue that we already have one example of such a bug in https://github.com/apache/arrow-datafusion/issues/423 😆 . I will add some more comments to try and make it harder to forget. 
   
   > I guess I just wonder if this is really worth the potential headaches :sweat_smile: 
   
   Well the real question is what is the alternative 🤔  Some thoughts are:
   1. Be conservative for operators like `Limit` and simply don't repartition / do anything to their inputs
   2. I could also special case Limit (for example look for a `SortExec` or `SortPreservingMerge` anywhere below it)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org