You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/07 12:55:05 UTC

[GitHub] [arrow] XiaokunDing opened a new pull request #8860: Implement row count statistics for Parquet TableProviderParquet statistics

XiaokunDing opened a new pull request #8860:
URL: https://github.com/apache/arrow/pull/8860


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-739908246


   https://issues.apache.org/jira/browse/ARROW-10783


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] XiaokunDing commented on a change in pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
XiaokunDing commented on a change in pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#discussion_r537972123



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -24,6 +24,15 @@ use crate::arrow::datatypes::SchemaRef;
 use crate::error::Result;
 use crate::physical_plan::ExecutionPlan;
 
+/// The table statistics
+#[derive(Clone, Debug)]
+pub struct Statistics {

Review comment:
       Thanks Mike, this is just a draft code for WIP,  when #8831 have been merged I'll recommit it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#discussion_r539374447



##########
File path: rust/datafusion/src/physical_plan/parquet.rs
##########
@@ -99,6 +100,38 @@ impl ParquetExec {
             batch_size,
         }
     }
+
+    /// Get the statistics from parquet file format
+    pub fn statistics(&self) -> Option<Statistics> {
+        let mut num_rows = 0;
+        let mut total_byte_size = 0;
+        for i in 0..self.filenames.len() {
+            let file = File::open(&self.filenames[i]).ok()?;
+            let file_reader = Arc::new(SerializedFileReader::new(file).ok()?);
+            num_rows += file_reader.metadata().file_metadata().num_rows() as i64;
+            for g in file_reader.metadata().row_groups().iter() {
+                total_byte_size += g.total_byte_size() as i64;
+            }
+        }
+
+        if num_rows >= i64::MAX {

Review comment:
       nit: you could use `std::cmp::min` and `std::cmp::max` instead of these `if` statements to find min/max values




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-742651345


   @XiaokunDing what about changing the statistics to be optional per statistic, but not the `statistics` itself?
   I was looking at adding it e.g. for a MemTable where it is easy to add a `num_rows` statistic, but (currently) not so trivial to add a `total_byte_size` implementation. Having the `num_rows` available could e.g. allow more efficient joins. 
   I guess this would be the same for other statistics that could be added later on.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-747774453


   FYI this PR was partially superceded by https://github.com/apache/arrow/pull/8944 -- unless @XiaokunDing  objects, I think we should close this PR. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-743992428


   @XiaokunDing do you plan updating this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] removed a comment on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] removed a comment on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-739903375


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-742653929


   So the proposal would be to change the statistics struct to 
   
   ```rust
   pub struct Statistics {
       /// The number of table rows
       pub num_rows: Option<usize>,
       /// total byte of the table rows
       pub total_byte_size: Option<usize>,
   }
   ```
   
   and the Statistics doesn't need to be wrapped as `Option<Statistics>` anymore.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-742665691


   Added a PR for that here: https://github.com/apache/arrow/pull/8889


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-748797127


   Closing as duplicate. Thanks @XiaokunDing for this initial implementation and for your interest in the Arrow project!
   
   Let us know if there was anything on our side that led you to not continue this work, so that we can also improve our process. If not, I will assume that you did not have the time, which is perfectly understandable :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] XiaokunDing closed pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
XiaokunDing closed pull request #8860:
URL: https://github.com/apache/arrow/pull/8860


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8860: Implement row count statistics for Parquet TableProviderParquet statistics

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-739903375


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#issuecomment-740452777


   Here is a link to the WIP i was doing: https://github.com/seddonm1/arrow/compare/master...seddonm1:parquet-statistics?expand=1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#discussion_r539672718



##########
File path: rust/datafusion/src/physical_plan/parquet.rs
##########
@@ -99,6 +100,38 @@ impl ParquetExec {
             batch_size,
         }
     }
+
+    /// Get the statistics from parquet file format
+    pub fn statistics(&self) -> Option<Statistics> {

Review comment:
       Sorry, for clarity, the ParquetReader already performs this code. You could expose the metadata produced in the first pass and then you don't need to reimplement this logic (which has to read all the files again).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8860:
URL: https://github.com/apache/arrow/pull/8860


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #8860: ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #8860:
URL: https://github.com/apache/arrow/pull/8860#discussion_r537793320



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -24,6 +24,15 @@ use crate::arrow::datatypes::SchemaRef;
 use crate::error::Result;
 use crate::physical_plan::ExecutionPlan;
 
+/// The table statistics
+#[derive(Clone, Debug)]
+pub struct Statistics {

Review comment:
       This is repeating what you have done in #8831

##########
File path: rust/datafusion/src/physical_plan/parquet.rs
##########
@@ -99,6 +100,38 @@ impl ParquetExec {
             batch_size,
         }
     }
+
+    /// Get the statistics from parquet file format
+    pub fn statistics(&self) -> Option<Statistics> {

Review comment:
       By doing this you are parsing the parquet file headers a second time when they have already been parsed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org