You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/23 16:33:35 UTC
[GitHub] [arrow] Dandandan opened a new pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Dandandan opened a new pull request #8998:
URL: https://github.com/apache/arrow/pull/8998
This adds extra statistics on the amount of nulls per column.
This is a step towards supporting more cost based optimizations.
Second step is adding number distinct, min, max values. With that we can have a good estimate of selectivity of filters, supporting more cases in which we could apply optimizations such as reordering joins.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
> Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (eb08b86) into [master](https://codecov.io/gh/apache/arrow/commit/0519c4c0ecccd7d84ce44bd3a3e7bcb4fef8f4d6?el=desc) (0519c4c) will **increase** coverage by `0.02%`.
> The diff coverage is `92.59%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8998 +/- ##
==========================================
+ Coverage 82.64% 82.67% +0.02%
==========================================
Files 200 200
Lines 49730 49795 +65
==========================================
+ Hits 41098 41166 +68
+ Misses 8632 8629 -3
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
| [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
| [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `84.76% <92.30%> (+1.17%)` | :arrow_up: |
| [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.65% <100.00%> (+0.03%)` | :arrow_up: |
| [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `84.49% <0.00%> (+0.31%)` | :arrow_up: |
| [rust/arrow/src/util/test\_util.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90ZXN0X3V0aWwucnM=) | `90.90% <0.00%> (+15.90%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [0519c4c...9043d2e](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
> Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (cff58b7) into [master](https://codecov.io/gh/apache/arrow/commit/0519c4c0ecccd7d84ce44bd3a3e7bcb4fef8f4d6?el=desc) (0519c4c) will **increase** coverage by `0.03%`.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8998 +/- ##
==========================================
+ Coverage 82.64% 82.67% +0.03%
==========================================
Files 200 200
Lines 49730 49795 +65
==========================================
+ Hits 41098 41169 +71
+ Misses 8632 8626 -6
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
| [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
| [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `86.09% <100.00%> (+2.49%)` | :arrow_up: |
| [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.65% <100.00%> (+0.03%)` | :arrow_up: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
| [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `84.49% <0.00%> (+0.31%)` | :arrow_up: |
| [rust/arrow/src/util/test\_util.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90ZXN0X3V0aWwucnM=) | `90.90% <0.00%> (+15.90%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [0519c4c...cff58b7](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
> Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (9043d2e) into [master](https://codecov.io/gh/apache/arrow/commit/0519c4c0ecccd7d84ce44bd3a3e7bcb4fef8f4d6?el=desc) (0519c4c) will **increase** coverage by `0.03%`.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8998 +/- ##
==========================================
+ Coverage 82.64% 82.67% +0.03%
==========================================
Files 200 200
Lines 49730 49798 +68
==========================================
+ Hits 41098 41172 +74
+ Misses 8632 8626 -6
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
| [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
| [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `86.36% <100.00%> (+2.76%)` | :arrow_up: |
| [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.65% <100.00%> (+0.03%)` | :arrow_up: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
| [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `84.49% <0.00%> (+0.31%)` | :arrow_up: |
| [rust/arrow/src/util/test\_util.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90ZXN0X3V0aWwucnM=) | `90.90% <0.00%> (+15.90%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [0519c4c...9043d2e](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
> Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (294a8b5) into [master](https://codecov.io/gh/apache/arrow/commit/1ecef42bb9fb9e91f0fb04c7d5a1c3be58390025?el=desc) (1ecef42) will **increase** coverage by `0.00%`.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8998 +/- ##
=======================================
Coverage 82.65% 82.66%
=======================================
Files 200 200
Lines 49795 49818 +23
=======================================
+ Hits 41159 41181 +22
- Misses 8636 8637 +1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
| [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
| [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
| [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `80.31% <ø> (ø)` | |
| [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `86.09% <100.00%> (+2.49%)` | :arrow_up: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [ca685a0...294a8b5](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750390545
https://issues.apache.org/jira/browse/ARROW-11018
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb closed pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8998:
URL: https://github.com/apache/arrow/pull/8998
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#discussion_r549106965
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -33,6 +33,14 @@ pub struct Statistics {
pub num_rows: Option<usize>,
/// total byte of the table rows
pub total_byte_size: Option<usize>,
+ /// Statistics on a column level
+ pub column_statistics: Option<Vec<ColumnStatistics>>,
+}
+/// This table statistics are estimates about column
Review comment:
Eventually the use of these statistics is probably more general than just datasources (aka for a cost based optimizer we would probably want the estimates to be attached to the output of all LogicalPlan nodes).
But this is a good start for now!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750695985
Thanks @dandandan this seems to make sense and I will have a closer look next week
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org