You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/24 16:23:32 UTC

[GitHub] [arrow] andygrove opened a new pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization

andygrove opened a new pull request #9007:
URL: https://github.com/apache/arrow/pull/9007


   Recurse down the plan when looking for number of rows when optimizing join order, so that we get the optimization even when the table scan is wrapped in a filter.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
   > Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (cfd65cf) into [master](https://codecov.io/gh/apache/arrow/commit/1ecef42bb9fb9e91f0fb04c7d5a1c3be58390025?el=desc) (1ecef42) will **increase** coverage by `0.00%`.
   > The diff coverage is `75.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9007   +/-   ##
   =======================================
     Coverage   82.65%   82.65%           
   =======================================
     Files         200      200           
     Lines       49795    49796    +1     
   =======================================
   + Hits        41159    41160    +1     
     Misses       8636     8636           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.55% <75.00%> (+0.45%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [ca685a0...cfd65cf](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750923525


   I think we need better estimates for how many rows are filtered in a `WHERE` clause (for which we need the column level statistics). 
   
   Currently, the reordering on TPC-H q12 is not applied, because the (very selective) `FILTER` which returns the unknown estimate.
   
   Did you run it with the order reversed / the optimization applied? I think it might actually be slower that way, because of the selective filter that can be pushed down (not 100% sure though)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Add documentation for code that determines number of rows per operator

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
   > Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (2bf200a) into [master](https://codecov.io/gh/apache/arrow/commit/cd22be6efedbf9832b5ea875ca59bb42de7b6c28?el=desc) (cd22be6) will **decrease** coverage by `0.00%`.
   > The diff coverage is `22.22%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9007      +/-   ##
   ==========================================
   - Coverage   82.55%   82.54%   -0.01%     
   ==========================================
     Files         203      203              
     Lines       50043    50049       +6     
   ==========================================
     Hits        41315    41315              
   - Misses       8728     8734       +6     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `54.73% <22.22%> (-3.70%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [cd22be6...2bf200a](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-751466654


   I think that definitely makes sense @alamb to test it like that, I think the second case (bogus stats) makes sense to start with, but to have a better look based on the TPCH queries makes sense too at some moment (I think we would need to support more queries first though that benefit from reordering).
   
   Currently we also only swap order `locally` between two nodes, at some point we would like to do have a more global optimization, making it even more important to have proper tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924715


   Running locally:
   
   Original order:
   
   ```
   Query 12 iteration 0 took 245.9 ms
   Query 12 iteration 1 took 250.5 ms
   Query 12 iteration 2 took 248.8 ms
   Query 12 iteration 3 took 244.2 ms
   Query 12 iteration 4 took 247.0 ms
   Query 12 iteration 5 took 246.1 ms
   Query 12 iteration 6 took 262.9 ms
   Query 12 iteration 7 took 284.0 ms
   Query 12 iteration 8 took 285.4 ms
   Query 12 iteration 9 took 287.5 ms
   Query 12 avg time: 260.24 ms
   ```
   
   
   Join order reversed:
   
   ```
   Query 12 iteration 0 took 1055.7 ms
   Query 12 iteration 1 took 1289.4 ms
   Query 12 iteration 2 took 1237.3 ms
   Query 12 iteration 3 took 1201.9 ms
   Query 12 iteration 4 took 1231.0 ms
   Query 12 iteration 5 took 1200.1 ms
   Query 12 iteration 6 took 1215.0 ms
   Query 12 iteration 7 took 1220.0 ms
   Query 12 iteration 8 took 1220.0 ms
   Query 12 iteration 9 took 1188.5 ms
   Query 12 avg time: 1205.90 ms
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #9007: ARROW-11029: [Rust] [DataFusion] Add documentation for code that determines number of rows per operator

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #9007:
URL: https://github.com/apache/arrow/pull/9007


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750921589


   @Dandandan Does this look ok to you?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750928580


   https://issues.apache.org/jira/browse/ARROW-11029


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
   > Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (516bf56) into [master](https://codecov.io/gh/apache/arrow/commit/2f5d59206f674ed1a1f5be6dbf809ba4d31b50da?el=desc) (2f5d592) will **decrease** coverage by `0.00%`.
   > The diff coverage is `66.66%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9007      +/-   ##
   ==========================================
   - Coverage   82.54%   82.53%   -0.01%     
   ==========================================
     Files         201      201              
     Lines       49983    49986       +3     
   ==========================================
   + Hits        41256    41257       +1     
   - Misses       8727     8729       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `58.69% <66.66%> (+0.26%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [2f5d592...516bf56](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#discussion_r549018865



##########
File path: rust/datafusion/src/optimizer/hash_build_probe_order.rs
##########
@@ -55,7 +53,25 @@ fn get_num_rows(logical_plan: &LogicalPlan) -> Option<usize> {
             let num_rows_input = get_num_rows(input);
             num_rows_input.map(|rows| std::cmp::min(*limit, rows))
         }
-        _ => None,
+        LogicalPlan::Aggregate { .. } => {
+            // we cannot yet predict how many rows will be produced by an aggregate because
+            // we do not know the cardinality of the grouping keys
+            None
+        }
+        LogicalPlan::Filter { .. } => {
+            // we cannot yet predict how many rows will be produced by a filter because
+            // we don't know how selective it is (how many rows it will filter out)
+            None
+        }
+        _ => {

Review comment:
       Yes, I think that would be better.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924386


   I originally looked at this because q12 never finishes for me now (at SF=100), but it still doesn't even with this so I think something else is going on here. I am pretty much out of time until Monday now though so will change this to a DRAFT.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-751466654


   I think that definitely makes sense @alamb to test it like that, I think the second case makes sense to start with, but to have a better look based on the TPCH queries makes sense too at some moment (I think we would need to support more queries first though that benefit from reordering).
   
   Currently we also only swap order `locally` between two nodes, at some point we would like to do have a more global optimization, making it even more important to have proper tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#discussion_r548630284



##########
File path: rust/datafusion/src/optimizer/hash_build_probe_order.rs
##########
@@ -55,7 +53,25 @@ fn get_num_rows(logical_plan: &LogicalPlan) -> Option<usize> {
             let num_rows_input = get_num_rows(input);
             num_rows_input.map(|rows| std::cmp::min(*limit, rows))
         }
-        _ => None,
+        LogicalPlan::Aggregate { .. } => {
+            // we cannot yet predict how many rows will be produced by an aggregate because
+            // we do not know the cardinality of the grouping keys
+            None
+        }
+        LogicalPlan::Filter { .. } => {
+            // we cannot yet predict how many rows will be produced by a filter because
+            // we don't know how selective it is (how many rows it will filter out)
+            None
+        }
+        _ => {

Review comment:
       Looks good. I wonder if it is better to explicitly match on all arms, so if we add a new LP alternative we have to think about the implications here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Add documentation for code that determines number of rows per operator

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
   > Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (516bf56) into [master](https://codecov.io/gh/apache/arrow/commit/2f6874158e3d94bd5eb31765c2550dce8d015c19?el=desc) (2f68741) will **decrease** coverage by `0.33%`.
   > The diff coverage is `66.66%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9007      +/-   ##
   ==========================================
   - Coverage   82.87%   82.53%   -0.34%     
   ==========================================
     Files         201      201              
     Lines       49737    49986     +249     
   ==========================================
   + Hits        41219    41257      +38     
   - Misses       8518     8729     +211     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `58.69% <66.66%> (+0.26%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2ZpbHRlci5ycw==) | `61.59% <0.00%> (-36.89%)` | :arrow_down: |
   | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `78.94% <0.00%> (-5.27%)` | :arrow_down: |
   | [rust/arrow/src/array/transform/list.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2xpc3QucnM=) | `81.81% <0.00%> (-2.06%)` | :arrow_down: |
   | [rust/arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL21vZC5ycw==) | `87.18% <0.00%> (-1.02%)` | :arrow_down: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   | [rust/arrow/src/util/bit\_chunk\_iterator.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9iaXRfY2h1bmtfaXRlcmF0b3IucnM=) | `96.47% <0.00%> (-0.09%)` | :arrow_down: |
   | [rust/arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3V0aWxzLnJz) | `100.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/transform/primitive.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3ByaW1pdGl2ZS5ycw==) | `100.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/transform/variable\_size.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3ZhcmlhYmxlX3NpemUucnM=) | `100.00% <0.00%> (ø)` | |
   | ... and [2 more](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [1ddf721...7685990](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924957


   I do think we should add special handling for filter though ( to return None for now)  and recurse by default. For example, there is now a repartition operator that is not handled and this does not change the number of rows. I will update the PR with these changes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924064


   I think the rules lined out here should be reasonably easy to add here if we have the statistics:
   https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750923525


   I think we need better estimates for how many rows are filtered in a `WHERE` clause (for which we need the column level statistics). 
   
   Currently, the reordering on TPC-H q12 is not applied, because the (very selective) `Filter` which returns the unknown estimate.
   
   Did you run it with the order reversed / the optimization applied? I think it might actually be slower that way, because of the selective filter that can be pushed down (not 100% sure though)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org