You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/25 09:09:20 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue #423: Repartition Optimisation Pass Breaks Sorting

tustvold opened a new issue #423:
URL: https://github.com/apache/arrow-datafusion/issues/423


   **Describe the bug**
   
   The Repartition optimisation pass fails to account for order sensitive operations, which causes it to repartition a sorted partition. As there isn't (yet) an order preserving merge operator (#362) this results in an incorrect final physical plan. 
   
   **To Reproduce**
   
   For example,
   
   ```
   LogicalPlanBuilder::scan_csv(&path, options, None)?
               .sort(vec![col("c4").sort(true, true)])?
               .filter(col("c2").eq(lit(1)))?
               .limit(10)?
               .build()?;
   ```
   
   Results in
   
   ```
   GlobalLimitExec {
       input: MergeExec {
           input: CoalesceBatchesExec {
               input: FilterExec {
                   predicate: BinaryExpr {
                       left: Column {
                           name: "c2",
                       },
                       op: Eq,
                       right: TryCastExpr {
                           expr: Literal {
                               value: Int32(1),
                           },
                           cast_type: Int64,
                       },
                   },
                   input: RepartitionExec {
                       input: SortExec {
                           input: CsvExec {
                               source: PartitionedFiles {
                                   path: "/home/raphael/repos/external/arrow-datafusion/testing/data/csv/aggregate_test_100.csv",
                                   filenames: [
                                       "/home/raphael/repos/external/arrow-datafusion/testing/data/csv/aggregate_test_100.csv",
                                   ],
                               },
                               ...
                           },
                           expr: [
                               PhysicalSortExpr {
                                   expr: Column {
                                       name: "c4",
                                   },
                                   options: SortOptions {
                                       descending: false,
                                       nulls_first: true,
                                   },
                               },
                           ],
                           ...
                       },
                       ...
                   },
               },
               target_batch_size: 4096,
           },
       },
       limit: 10,
   }
   ```
   
   This is incorrect, as the MergeExec does not preserve ordering and therefore the limit will return an arbitrary set of rows
   
   **Expected behavior**
   
   I would expect the Repartition pass to not introduce partitioning on paths with operators that are order sensitive. Potentially once there is an order preserving merge, this restriction could be relaxed provided the inserted merge is order preserving, but until then this optimisation is incorrect. 
   
   I will shortly be creating a ticket with some proposals on how we could make DataFusion sort order aware
   
   **Additional context**
   
   This issue was spawned out of investigation on https://github.com/apache/arrow-datafusion/pull/378#issuecomment-847158217
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #423: Repartition Optimisation Pass Breaks Sorting (wrong answer with limit)

Posted by GitBox <gi...@apache.org>.
alamb closed issue #423:
URL: https://github.com/apache/arrow-datafusion/issues/423


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #423: Repartition Optimisation Pass Breaks Sorting (wrong answer with limit)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #423:
URL: https://github.com/apache/arrow-datafusion/issues/423#issuecomment-848027200


   @tustvold thanks for the detailed write up!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on issue #423: Repartition Optimisation Pass Breaks Sorting (wrong answer with limit)

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #423:
URL: https://github.com/apache/arrow-datafusion/issues/423#issuecomment-954584895


   Didn't look into this in details, but it looks very much like something that would require to dig up https://github.com/apache/arrow-datafusion/issues/92 to be solved 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #423: Repartition Optimisation Pass Breaks Sorting (wrong answer with limit)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #423:
URL: https://github.com/apache/arrow-datafusion/issues/423#issuecomment-847982417


   This is a good find 🕵️  👍  @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #423: Repartition Optimisation Pass Breaks Sorting (wrong answer with limit)

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #423:
URL: https://github.com/apache/arrow-datafusion/issues/423#issuecomment-1030133500


   I just hit a variation of this in IOx -- see https://github.com/influxdata/influxdb_iox/pull/3633#issuecomment-1030126757
   
   Conveniently @tustvold added some infrastructure to support this in https://github.com/apache/arrow-datafusion/pull/1732


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org