You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/08 07:33:26 UTC

[GitHub] [arrow-rs] crepererum commented on a diff in pull request #2022: Fix several bugs in parquet writer statistics generation, add `EnabledStatistics` to control level of statistics generated

crepererum commented on code in PR #2022:
URL: https://github.com/apache/arrow-rs/pull/2022#discussion_r916542590


##########
parquet/src/column/writer.rs:
##########
@@ -1029,59 +1025,58 @@ impl<'a, T: DataType> ColumnWriterImpl<'a, T> {
         }
     }
 
-    #[allow(clippy::eq_op)]
     fn update_page_min_max(&mut self, val: &T::T) {
+        Self::update_min(&self.descr, val, &mut self.min_page_value);
+        Self::update_max(&self.descr, val, &mut self.max_page_value);
+    }
+
+    fn update_column_min_max(&mut self) {
+        let min = self.min_page_value.as_ref().unwrap();
+        Self::update_min(&self.descr, min, &mut self.min_column_value);
+
+        let max = self.max_page_value.as_ref().unwrap();
+        Self::update_max(&self.descr, max, &mut self.max_column_value);
+    }
+
+    fn update_min(descr: &ColumnDescriptor, val: &T::T, min: &mut Option<T::T>) {
+        Self::update_stat(val, min, |cur| Self::compare_greater(descr, cur, val))
+    }
+
+    fn update_max(descr: &ColumnDescriptor, val: &T::T, max: &mut Option<T::T>) {
+        Self::update_stat(val, max, |cur| Self::compare_greater(descr, val, cur))
+    }
+
+    /// Perform a conditional update of `cur`, skipping any NaN values
+    ///
+    /// If `cur` is `None`, sets `cur` to `Some(val)`, otherwise calls `should_update` with
+    /// the value of `cur`, and updates `cur` to `Some(val)` if it returns `true`

Review Comment:
   If I read this correctly the NaN handling here is to avoid "poisoning" which is OK.
   
   IMHO the stats should include a flag if there were any NaN values, but this is far outside the scope of this PR since this would affect the parquet format as a whole.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org