You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/24 16:23:32 UTC
[GitHub] [arrow] andygrove opened a new pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization
andygrove opened a new pull request #9007:
URL: https://github.com/apache/arrow/pull/9007
Recurse down the plan when looking for number of rows when optimizing join order, so that we get the optimization even when the table scan is wrapped in a filter.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
> Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (cfd65cf) into [master](https://codecov.io/gh/apache/arrow/commit/1ecef42bb9fb9e91f0fb04c7d5a1c3be58390025?el=desc) (1ecef42) will **increase** coverage by `0.00%`.
> The diff coverage is `75.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9007 +/- ##
=======================================
Coverage 82.65% 82.65%
=======================================
Files 200 200
Lines 49795 49796 +1
=======================================
+ Hits 41159 41160 +1
Misses 8636 8636
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.55% <75.00%> (+0.45%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [ca685a0...cfd65cf](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750923525
I think we need better estimates for how many rows are filtered in a `WHERE` clause (for which we need the column level statistics).
Currently, the reordering on TPC-H q12 is not applied, because the (very selective) `FILTER` which returns the unknown estimate.
Did you run it with the order reversed / the optimization applied? I think it might actually be slower that way, because of the selective filter that can be pushed down (not 100% sure though)?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Add documentation for code that determines number of rows per operator
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
> Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (2bf200a) into [master](https://codecov.io/gh/apache/arrow/commit/cd22be6efedbf9832b5ea875ca59bb42de7b6c28?el=desc) (cd22be6) will **decrease** coverage by `0.00%`.
> The diff coverage is `22.22%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9007 +/- ##
==========================================
- Coverage 82.55% 82.54% -0.01%
==========================================
Files 203 203
Lines 50043 50049 +6
==========================================
Hits 41315 41315
- Misses 8728 8734 +6
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `54.73% <22.22%> (-3.70%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [cd22be6...2bf200a](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-751466654
I think that definitely makes sense @alamb to test it like that, I think the second case (bogus stats) makes sense to start with, but to have a better look based on the TPCH queries makes sense too at some moment (I think we would need to support more queries first though that benefit from reordering).
Currently we also only swap order `locally` between two nodes, at some point we would like to do have a more global optimization, making it even more important to have proper tests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924715
Running locally:
Original order:
```
Query 12 iteration 0 took 245.9 ms
Query 12 iteration 1 took 250.5 ms
Query 12 iteration 2 took 248.8 ms
Query 12 iteration 3 took 244.2 ms
Query 12 iteration 4 took 247.0 ms
Query 12 iteration 5 took 246.1 ms
Query 12 iteration 6 took 262.9 ms
Query 12 iteration 7 took 284.0 ms
Query 12 iteration 8 took 285.4 ms
Query 12 iteration 9 took 287.5 ms
Query 12 avg time: 260.24 ms
```
Join order reversed:
```
Query 12 iteration 0 took 1055.7 ms
Query 12 iteration 1 took 1289.4 ms
Query 12 iteration 2 took 1237.3 ms
Query 12 iteration 3 took 1201.9 ms
Query 12 iteration 4 took 1231.0 ms
Query 12 iteration 5 took 1200.1 ms
Query 12 iteration 6 took 1215.0 ms
Query 12 iteration 7 took 1220.0 ms
Query 12 iteration 8 took 1220.0 ms
Query 12 iteration 9 took 1188.5 ms
Query 12 avg time: 1205.90 ms
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao closed pull request #9007: ARROW-11029: [Rust] [DataFusion] Add documentation for code that determines number of rows per operator
Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #9007:
URL: https://github.com/apache/arrow/pull/9007
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750921589
@Dandandan Does this look ok to you?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750928580
https://issues.apache.org/jira/browse/ARROW-11029
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
> Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (516bf56) into [master](https://codecov.io/gh/apache/arrow/commit/2f5d59206f674ed1a1f5be6dbf809ba4d31b50da?el=desc) (2f5d592) will **decrease** coverage by `0.00%`.
> The diff coverage is `66.66%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9007 +/- ##
==========================================
- Coverage 82.54% 82.53% -0.01%
==========================================
Files 201 201
Lines 49983 49986 +3
==========================================
+ Hits 41256 41257 +1
- Misses 8727 8729 +2
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `58.69% <66.66%> (+0.26%)` | :arrow_up: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [2f5d592...516bf56](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on a change in pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#discussion_r549018865
##########
File path: rust/datafusion/src/optimizer/hash_build_probe_order.rs
##########
@@ -55,7 +53,25 @@ fn get_num_rows(logical_plan: &LogicalPlan) -> Option<usize> {
let num_rows_input = get_num_rows(input);
num_rows_input.map(|rows| std::cmp::min(*limit, rows))
}
- _ => None,
+ LogicalPlan::Aggregate { .. } => {
+ // we cannot yet predict how many rows will be produced by an aggregate because
+ // we do not know the cardinality of the grouping keys
+ None
+ }
+ LogicalPlan::Filter { .. } => {
+ // we cannot yet predict how many rows will be produced by a filter because
+ // we don't know how selective it is (how many rows it will filter out)
+ None
+ }
+ _ => {
Review comment:
Yes, I think that would be better.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924386
I originally looked at this because q12 never finishes for me now (at SF=100), but it still doesn't even with this so I think something else is going on here. I am pretty much out of time until Monday now though so will change this to a DRAFT.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-751466654
I think that definitely makes sense @alamb to test it like that, I think the second case makes sense to start with, but to have a better look based on the TPCH queries makes sense too at some moment (I think we would need to support more queries first though that benefit from reordering).
Currently we also only swap order `locally` between two nodes, at some point we would like to do have a more global optimization, making it even more important to have proper tests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#discussion_r548630284
##########
File path: rust/datafusion/src/optimizer/hash_build_probe_order.rs
##########
@@ -55,7 +53,25 @@ fn get_num_rows(logical_plan: &LogicalPlan) -> Option<usize> {
let num_rows_input = get_num_rows(input);
num_rows_input.map(|rows| std::cmp::min(*limit, rows))
}
- _ => None,
+ LogicalPlan::Aggregate { .. } => {
+ // we cannot yet predict how many rows will be produced by an aggregate because
+ // we do not know the cardinality of the grouping keys
+ None
+ }
+ LogicalPlan::Filter { .. } => {
+ // we cannot yet predict how many rows will be produced by a filter because
+ // we don't know how selective it is (how many rows it will filter out)
+ None
+ }
+ _ => {
Review comment:
Looks good. I wonder if it is better to explicitly match on all arms, so if we add a new LP alternative we have to think about the implications here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Add documentation for code that determines number of rows per operator
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924264
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=h1) Report
> Merging [#9007](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=desc) (516bf56) into [master](https://codecov.io/gh/apache/arrow/commit/2f6874158e3d94bd5eb31765c2550dce8d015c19?el=desc) (2f68741) will **decrease** coverage by `0.33%`.
> The diff coverage is `66.66%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9007/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9007 +/- ##
==========================================
- Coverage 82.87% 82.53% -0.34%
==========================================
Files 201 201
Lines 49737 49986 +249
==========================================
+ Hits 41219 41257 +38
- Misses 8518 8729 +211
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `58.69% <66.66%> (+0.26%)` | :arrow_up: |
| [rust/arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2ZpbHRlci5ycw==) | `61.59% <0.00%> (-36.89%)` | :arrow_down: |
| [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `78.94% <0.00%> (-5.27%)` | :arrow_down: |
| [rust/arrow/src/array/transform/list.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2xpc3QucnM=) | `81.81% <0.00%> (-2.06%)` | :arrow_down: |
| [rust/arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL21vZC5ycw==) | `87.18% <0.00%> (-1.02%)` | :arrow_down: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
| [rust/arrow/src/util/bit\_chunk\_iterator.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9iaXRfY2h1bmtfaXRlcmF0b3IucnM=) | `96.47% <0.00%> (-0.09%)` | :arrow_down: |
| [rust/arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3V0aWxzLnJz) | `100.00% <0.00%> (ø)` | |
| [rust/arrow/src/array/transform/primitive.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3ByaW1pdGl2ZS5ycw==) | `100.00% <0.00%> (ø)` | |
| [rust/arrow/src/array/transform/variable\_size.rs](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3ZhcmlhYmxlX3NpemUucnM=) | `100.00% <0.00%> (ø)` | |
| ... and [2 more](https://codecov.io/gh/apache/arrow/pull/9007/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=footer). Last update [1ddf721...7685990](https://codecov.io/gh/apache/arrow/pull/9007?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization [WIP]
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924957
I do think we should add special handling for filter though ( to return None for now) and recurse by default. For example, there is now a repartition operator that is not handled and this does not change the number of rows. I will update the PR with these changes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750924064
I think the rules lined out here should be reasonably easy to add here if we have the statistics:
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan edited a comment on pull request #9007: ARROW-11029: [Rust] [DataFusion] Improve join order optimization
Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #9007:
URL: https://github.com/apache/arrow/pull/9007#issuecomment-750923525
I think we need better estimates for how many rows are filtered in a `WHERE` clause (for which we need the column level statistics).
Currently, the reordering on TPC-H q12 is not applied, because the (very selective) `Filter` which returns the unknown estimate.
Did you run it with the order reversed / the optimization applied? I think it might actually be slower that way, because of the selective filter that can be pushed down (not 100% sure though)?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org