You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/16 20:47:33 UTC

[GitHub] [arrow] seddonm1 opened a new pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

seddonm1 opened a new pull request #8944:
URL: https://github.com/apache/arrow/pull/8944


   This is an implementation of Statistics for the Parquet datasource as this PR seems to have gone quiet: https://github.com/apache/arrow/pull/8860
   
   This PR exposes the ParquetMetaData produced by the file reader as even though initially we only consume `row_count` and `total_byte_size` there is column level metadata that may be very useful in future for potentially large performance gains.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#discussion_r544618924



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -41,10 +41,25 @@ impl ParquetTable {
     pub fn try_new(path: &str) -> Result<Self> {
         let parquet_exec = ParquetExec::try_new(path, None, 0)?;
         let schema = parquet_exec.schema();
+
+        let metadata = parquet_exec.metadata();
+        let (num_rows, total_byte_size) = metadata.row_groups().iter().fold(

Review comment:
       updated to two iterators




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#discussion_r544614359



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -41,10 +41,25 @@ impl ParquetTable {
     pub fn try_new(path: &str) -> Result<Self> {
         let parquet_exec = ParquetExec::try_new(path, None, 0)?;
         let schema = parquet_exec.schema();
+
+        let metadata = parquet_exec.metadata();
+        let (num_rows, total_byte_size) = metadata.row_groups().iter().fold(

Review comment:
       I think this would be better as two "loops", than it can use `.sum()` as well instead of fold?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#discussion_r544622788



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -41,10 +41,26 @@ impl ParquetTable {
     pub fn try_new(path: &str) -> Result<Self> {
         let parquet_exec = ParquetExec::try_new(path, None, 0)?;
         let schema = parquet_exec.schema();
+
+        let metadata = parquet_exec.metadata();
+        let num_rows = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.num_rows() as usize)
+            .sum();
+        let total_byte_size = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.total_byte_size() as usize)

Review comment:
       I think the as usize could move to below (`Some(num_rows as usize)`).
   
   Than it could use `.map(RowGroupMetaData::total_byte_size)` as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#discussion_r544626451



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -41,10 +41,26 @@ impl ParquetTable {
     pub fn try_new(path: &str) -> Result<Self> {
         let parquet_exec = ParquetExec::try_new(path, None, 0)?;
         let schema = parquet_exec.schema();
+
+        let metadata = parquet_exec.metadata();
+        let num_rows = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.num_rows() as usize)
+            .sum();
+        let total_byte_size = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.total_byte_size() as usize)

Review comment:
       I am not familiar with this `.map(RowGroupMetaData::total_byte_size)` convention. do you know what this is called so I can look up the Rust docs?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#discussion_r544622788



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -41,10 +41,26 @@ impl ParquetTable {
     pub fn try_new(path: &str) -> Result<Self> {
         let parquet_exec = ParquetExec::try_new(path, None, 0)?;
         let schema = parquet_exec.schema();
+
+        let metadata = parquet_exec.metadata();
+        let num_rows = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.num_rows() as usize)
+            .sum();
+        let total_byte_size = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.total_byte_size() as usize)

Review comment:
       I think the as usize could move to below (`Some(num_rows as usize)`).
   
   Then it could use `.map(RowGroupMetaData::total_byte_size)` as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#discussion_r544637333



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -41,10 +41,26 @@ impl ParquetTable {
     pub fn try_new(path: &str) -> Result<Self> {
         let parquet_exec = ParquetExec::try_new(path, None, 0)?;
         let schema = parquet_exec.schema();
+
+        let metadata = parquet_exec.metadata();
+        let num_rows = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.num_rows() as usize)
+            .sum();
+        let total_byte_size = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.total_byte_size() as usize)

Review comment:
       The way it is now is called "closure" and it could be replaced by a the name of the function instead. There is something on the clippy docs:
   https://rust-lang.github.io/rust-clippy/master/#redundant_closure




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#issuecomment-746994271


   https://issues.apache.org/jira/browse/ARROW-10783


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#discussion_r544646610



##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -41,10 +41,26 @@ impl ParquetTable {
     pub fn try_new(path: &str) -> Result<Self> {
         let parquet_exec = ParquetExec::try_new(path, None, 0)?;
         let schema = parquet_exec.schema();
+
+        let metadata = parquet_exec.metadata();
+        let num_rows = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.num_rows() as usize)
+            .sum();
+        let total_byte_size = metadata
+            .row_groups()
+            .iter()
+            .map(|rg| rg.total_byte_size() as usize)

Review comment:
       Thanks. I have updated the PR. I should have recognised this from Scala.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#issuecomment-747071235


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8944?src=pr&el=h1) Report
   > Merging [#8944](https://codecov.io/gh/apache/arrow/pull/8944?src=pr&el=desc) (a3bc7d1) into [master](https://codecov.io/gh/apache/arrow/commit/71e37e23e5a3aad396585df484625c2d0641840d?el=desc) (71e37e2) will **decrease** coverage by `0.00%`.
   > The diff coverage is `88.23%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8944/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8944?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8944      +/-   ##
   ==========================================
   - Coverage   83.26%   83.26%   -0.01%     
   ==========================================
     Files         195      196       +1     
     Lines       48066    48131      +65     
   ==========================================
   + Hits        40024    40078      +54     
   - Misses       8042     8053      +11     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8944?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/parquet/src/file/metadata.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9maWxlL21ldGFkYXRhLnJz) | `91.82% <ø> (+0.77%)` | :arrow_up: |
   | [rust/parquet/src/file/statistics.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9maWxlL3N0YXRpc3RpY3MucnM=) | `93.80% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.62% <80.00%> (+0.30%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `79.56% <100.00%> (+0.91%)` | :arrow_up: |
   | [rust/parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJvd19yZWFkZXIucnM=) | `90.69% <100.00%> (+0.10%)` | :arrow_up: |
   | [rust/datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vY29udGV4dC5ycw==) | `89.95% <0.00%> (-1.93%)` | :arrow_down: |
   | [rust/datafusion/tests/custom\_sources.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL2N1c3RvbV9zb3VyY2VzLnJz) | `75.00% <0.00%> (ø)` | |
   | [rust/parquet/src/schema/types.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9zY2hlbWEvdHlwZXMucnM=) | `90.19% <0.00%> (+0.26%)` | :arrow_up: |
   | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `93.28% <0.00%> (+0.85%)` | :arrow_up: |
   | ... and [2 more](https://codecov.io/gh/apache/arrow/pull/8944/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8944?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8944?src=pr&el=footer). Last update [6cedab0...a3bc7d1](https://codecov.io/gh/apache/arrow/pull/8944?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8944:
URL: https://github.com/apache/arrow/pull/8944


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8944: ARROW-10783: [Rust][DataFusion] Implement Statistics for Parquet TableProvider

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8944:
URL: https://github.com/apache/arrow/pull/8944#issuecomment-747773484


   FYI @XiaokunDing 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org