You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/23 16:33:35 UTC

[GitHub] [arrow] Dandandan opened a new pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Dandandan opened a new pull request #8998:
URL: https://github.com/apache/arrow/pull/8998


   This adds extra statistics on the amount of nulls per column.
   
   This is a step towards supporting more cost based optimizations.
   
   Second step is adding number distinct, min, max values. With that we can have a good estimate of selectivity of filters, supporting more cases in which we could apply optimizations such as reordering joins.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
   > Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (eb08b86) into [master](https://codecov.io/gh/apache/arrow/commit/0519c4c0ecccd7d84ce44bd3a3e7bcb4fef8f4d6?el=desc) (0519c4c) will **increase** coverage by `0.02%`.
   > The diff coverage is `92.59%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8998      +/-   ##
   ==========================================
   + Coverage   82.64%   82.67%   +0.02%     
   ==========================================
     Files         200      200              
     Lines       49730    49795      +65     
   ==========================================
   + Hits        41098    41166      +68     
   + Misses       8632     8629       -3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `84.76% <92.30%> (+1.17%)` | :arrow_up: |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.65% <100.00%> (+0.03%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `84.49% <0.00%> (+0.31%)` | :arrow_up: |
   | [rust/arrow/src/util/test\_util.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90ZXN0X3V0aWwucnM=) | `90.90% <0.00%> (+15.90%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [0519c4c...9043d2e](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
   > Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (cff58b7) into [master](https://codecov.io/gh/apache/arrow/commit/0519c4c0ecccd7d84ce44bd3a3e7bcb4fef8f4d6?el=desc) (0519c4c) will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8998      +/-   ##
   ==========================================
   + Coverage   82.64%   82.67%   +0.03%     
   ==========================================
     Files         200      200              
     Lines       49730    49795      +65     
   ==========================================
   + Hits        41098    41169      +71     
   + Misses       8632     8626       -6     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `86.09% <100.00%> (+2.49%)` | :arrow_up: |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.65% <100.00%> (+0.03%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `84.49% <0.00%> (+0.31%)` | :arrow_up: |
   | [rust/arrow/src/util/test\_util.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90ZXN0X3V0aWwucnM=) | `90.90% <0.00%> (+15.90%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [0519c4c...cff58b7](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
   > Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (9043d2e) into [master](https://codecov.io/gh/apache/arrow/commit/0519c4c0ecccd7d84ce44bd3a3e7bcb4fef8f4d6?el=desc) (0519c4c) will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8998      +/-   ##
   ==========================================
   + Coverage   82.64%   82.67%   +0.03%     
   ==========================================
     Files         200      200              
     Lines       49730    49798      +68     
   ==========================================
   + Hits        41098    41172      +74     
   + Misses       8632     8626       -6     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `86.36% <100.00%> (+2.76%)` | :arrow_up: |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.65% <100.00%> (+0.03%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `84.49% <0.00%> (+0.31%)` | :arrow_up: |
   | [rust/arrow/src/util/test\_util.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90ZXN0X3V0aWwucnM=) | `90.90% <0.00%> (+15.90%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [0519c4c...9043d2e](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750421499


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=h1) Report
   > Merging [#8998](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=desc) (294a8b5) into [master](https://codecov.io/gh/apache/arrow/commit/1ecef42bb9fb9e91f0fb04c7d5a1c3be58390025?el=desc) (1ecef42) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8998/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #8998   +/-   ##
   =======================================
     Coverage   82.65%   82.66%           
   =======================================
     Files         200      200           
     Lines       49795    49818   +23     
   =======================================
   + Hits        41159    41181   +22     
   - Misses       8636     8637    +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `70.58% <ø> (ø)` | |
   | [...datafusion/src/optimizer/hash\_build\_probe\_order.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvaGFzaF9idWlsZF9wcm9iZV9vcmRlci5ycw==) | `59.09% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `80.31% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `86.09% <100.00%> (+2.49%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/8998/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=footer). Last update [ca685a0...294a8b5](https://codecov.io/gh/apache/arrow/pull/8998?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750390545


   https://issues.apache.org/jira/browse/ARROW-11018


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8998:
URL: https://github.com/apache/arrow/pull/8998


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#discussion_r549106965



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -33,6 +33,14 @@ pub struct Statistics {
     pub num_rows: Option<usize>,
     /// total byte of the table rows
     pub total_byte_size: Option<usize>,
+    /// Statistics on a column level
+    pub column_statistics: Option<Vec<ColumnStatistics>>,
+}
+/// This table statistics are estimates about column

Review comment:
       Eventually the use of these statistics is probably more general than just datasources (aka for a cost based optimizer we would probably want the estimates to be attached to the output of all LogicalPlan nodes). 
   
   But this is a good start for now!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8998: ARROW-11018: [Rust][DataFusion] Add support for column-level statistics, null count.

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8998:
URL: https://github.com/apache/arrow/pull/8998#issuecomment-750695985


   Thanks @dandandan this seems to make sense and I will have a closer look next week


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org