You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/02 09:16:31 UTC

[GitHub] [arrow] zeevm opened a new pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

zeevm opened a new pull request #7586:
URL: https://github.com/apache/arrow/pull/7586


   1. Calculate page and column statistics
   2. Use pre-calculated statistics when available to speed-up when writing data from other formats like ORC.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448757699



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       Are you specifically referring to 'page_distinct_count', or all 4 page level vars?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448814477



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       They're used here:
   
   flush_data_pages()
   make_typed_statistics()
   update_page_min_max()
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm closed pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm closed pull request #7586:
URL: https://github.com/apache/arrow/pull/7586


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #7586: Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-651810251


   Hi @zeevm, may you please kindly rebase (to fix the Rust failures) and open a JIRA for this PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm closed pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm closed pull request #7586:
URL: https://github.com/apache/arrow/pull/7586


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm removed a comment on pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm removed a comment on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-653040440


   How do I complete/merge the PR, do I need to request write access or will it be completed by a maintainer?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448745197



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -276,12 +372,60 @@ impl<T: DataType> ColumnWriterImpl<T> {
             &values[values_offset..],
             def_levels.map(|lv| &lv[levels_offset..]),
             rep_levels.map(|lv| &lv[levels_offset..]),
+            calculate_page_stats,
         )?;
 
         // Return total number of values processed.
         Ok(values_offset)
     }
 
+    /// Writes batch of values, definition levels and repetition levels.
+    /// Returns number of values processed (written).
+    ///
+    /// If definition and repetition levels are provided, we write fully those levels and
+    /// select how many values to write (this number will be returned), since number of
+    /// actual written values may be smaller than provided values.
+    ///
+    /// If only values are provided, then all values are written and the length of
+    /// of the values buffer is returned.
+    ///
+    /// Definition and/or repetition levels can be omitted, if values are
+    /// non-nullable and/or non-repeated.
+    pub fn write_batch(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+    ) -> Result<usize> {
+        self.write_batch_internal(
+            values, def_levels, rep_levels, &None, &None, None, None,
+        )
+    }
+
+    /// Writer may optionally provide pre-calculated statistics for this batch, in which case we do
+    /// not calculate page level statistics as this will defeat the purpose of speeding up the write
+    /// process with pre-calculated statistics.
+    pub fn write_batch_with_statistics(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+        min: &Option<T::T>,

Review comment:
       IIUC the format specifies that the various stats are optional so it seems reasonable to allow the caller to specify only some of the values isn't it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448764844



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       All of them are used, page_distinct_count isn't being calculated in this PR though, probably a following PR




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-653122116


   @zeevm once approved, a committer will help merge this. Seems the PR now is a little messed up, can you clean it up so I can merge it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448014374



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -387,15 +538,28 @@ impl<T: DataType> ColumnWriterImpl<T> {
             ));
         }
 
-        // TODO: update page statistics
+        if calculate_page_stats {
+            for val in &values[0..values_to_write] {
+                if self.min_page_value.is_none()

Review comment:
       maybe we should extract this into a util function `update_page_min_max` and similarly for column min/max value.

##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -276,12 +372,60 @@ impl<T: DataType> ColumnWriterImpl<T> {
             &values[values_offset..],
             def_levels.map(|lv| &lv[levels_offset..]),
             rep_levels.map(|lv| &lv[levels_offset..]),
+            calculate_page_stats,
         )?;
 
         // Return total number of values processed.
         Ok(values_offset)
     }
 
+    /// Writes batch of values, definition levels and repetition levels.
+    /// Returns number of values processed (written).
+    ///
+    /// If definition and repetition levels are provided, we write fully those levels and
+    /// select how many values to write (this number will be returned), since number of
+    /// actual written values may be smaller than provided values.
+    ///
+    /// If only values are provided, then all values are written and the length of
+    /// of the values buffer is returned.
+    ///
+    /// Definition and/or repetition levels can be omitted, if values are
+    /// non-nullable and/or non-repeated.
+    pub fn write_batch(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+    ) -> Result<usize> {
+        self.write_batch_internal(
+            values, def_levels, rep_levels, &None, &None, None, None,
+        )
+    }
+
+    /// Writer may optionally provide pre-calculated statistics for this batch, in which case we do
+    /// not calculate page level statistics as this will defeat the purpose of speeding up the write
+    /// process with pre-calculated statistics.
+    pub fn write_batch_with_statistics(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+        min: &Option<T::T>,

Review comment:
       Is it possible that among these 4 parameters, ppl only provide, say, `nulls_count` but leave the rest as `None`? will this result to partial stats and yield to issues when compute engines want to rely on them? If so do we want to enforce that either all of these 4 are `None` OR all of these are `Some`?

##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       Seems this is not used at all?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448870420



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       How do I merge? Do I need to ask for merge/write permissions to complete the PR?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448870420



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       How do I merge? Do I need to ask for merge/write permissions to complete the PR?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448842308



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       Yes I meant `page_distinct_count`. It is fine to do this in a follow upo.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448745612



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       These are used to track page level stats and write those stats when writing a page and to update the column level stats when writing the page.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-651852823


   https://issues.apache.org/jira/browse/ARROW-9280


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7586: Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-651495743


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448755889



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       Yes. But it is not updated nor used at all. Can you double check?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeevm commented on pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

Posted by GitBox <gi...@apache.org>.
zeevm commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-653040440


   How do I complete/merge the PR, do I need to request write access or will it be completed by a maintainer?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org