You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/30 22:29:10 UTC

[GitHub] [arrow] sunchao commented on a change in pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

sunchao commented on a change in pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#discussion_r448014374



##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -387,15 +538,28 @@ impl<T: DataType> ColumnWriterImpl<T> {
             ));
         }
 
-        // TODO: update page statistics
+        if calculate_page_stats {
+            for val in &values[0..values_to_write] {
+                if self.min_page_value.is_none()

Review comment:
       maybe we should extract this into a util function `update_page_min_max` and similarly for column min/max value.

##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -276,12 +372,60 @@ impl<T: DataType> ColumnWriterImpl<T> {
             &values[values_offset..],
             def_levels.map(|lv| &lv[levels_offset..]),
             rep_levels.map(|lv| &lv[levels_offset..]),
+            calculate_page_stats,
         )?;
 
         // Return total number of values processed.
         Ok(values_offset)
     }
 
+    /// Writes batch of values, definition levels and repetition levels.
+    /// Returns number of values processed (written).
+    ///
+    /// If definition and repetition levels are provided, we write fully those levels and
+    /// select how many values to write (this number will be returned), since number of
+    /// actual written values may be smaller than provided values.
+    ///
+    /// If only values are provided, then all values are written and the length of
+    /// of the values buffer is returned.
+    ///
+    /// Definition and/or repetition levels can be omitted, if values are
+    /// non-nullable and/or non-repeated.
+    pub fn write_batch(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+    ) -> Result<usize> {
+        self.write_batch_internal(
+            values, def_levels, rep_levels, &None, &None, None, None,
+        )
+    }
+
+    /// Writer may optionally provide pre-calculated statistics for this batch, in which case we do
+    /// not calculate page level statistics as this will defeat the purpose of speeding up the write
+    /// process with pre-calculated statistics.
+    pub fn write_batch_with_statistics(
+        &mut self,
+        values: &[T::T],
+        def_levels: Option<&[i16]>,
+        rep_levels: Option<&[i16]>,
+        min: &Option<T::T>,

Review comment:
       Is it possible that among these 4 parameters, ppl only provide, say, `nulls_count` but leave the rest as `None`? will this result to partial stats and yield to issues when compute engines want to rely on them? If so do we want to enforce that either all of these 4 are `None` OR all of these are `Some`?

##########
File path: rust/parquet/src/column/writer.rs
##########
@@ -216,26 +278,26 @@ impl<T: DataType> ColumnWriterImpl<T> {
             def_levels_sink: vec![],
             rep_levels_sink: vec![],
             data_pages: VecDeque::new(),
+            min_page_value: None,
+            max_page_value: None,
+            num_page_nulls: 0,
+            page_distinct_count: None,

Review comment:
       Seems this is not used at all?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org