You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/04 08:11:48 UTC

[GitHub] [arrow-rs] crepererum opened a new issue #255: NaNs can break parquet statistics

crepererum opened a new issue #255:
URL: https://github.com/apache/arrow-rs/issues/255


   **Describe the bug**
   NaN can occur in parquet statistics and override all other possible values. This is very similar to [PARQUET-1225](https://issues.apache.org/jira/browse/PARQUET-1225) which was filed for the C++ implementation.
   
   **To Reproduce**
   Add the following tests:
   
   ```rust
   #[test]
   fn test_float_statistics_nan_middle() {
       let stats = statistics_roundtrip::<FloatType>(&[1.0, f32::NAN, 2.0]);
       assert!(stats.has_min_max_set());
       if let Statistics::Float(stats) = stats {
           assert_eq!(stats.min(), &1.0);
           assert_eq!(stats.max(), &2.0);
       } else {
           panic!("expecting Statistics::Float");
       }
   }
   
   #[test]
   fn test_float_statistics_nan_start() {
       let stats = statistics_roundtrip::<FloatType>(&[f32::NAN, 1.0, 2.0]);
       assert!(stats.has_min_max_set());
       if let Statistics::Float(stats) = stats {
           assert_eq!(stats.min(), &1.0);
           assert_eq!(stats.max(), &2.0);
       } else {
           panic!("expecting Statistics::Float");
       }
   }
   
   #[test]
   fn test_float_statistics_nan_only() {
       let stats = statistics_roundtrip::<FloatType>(&[f32::NAN, f32::NAN]);
       assert!(!stats.has_min_max_set());
       assert!(matches!(stats, Statistics::Float(_)));
   }
   
   fn statistics_roundtrip<T: DataType>(values: &[<T as DataType>::T]) -> Statistics {
       let page_writer = get_test_page_writer();
       let props = Arc::new(WriterProperties::builder().build());
       let mut writer = get_test_column_writer::<T>(page_writer, 0, 0, props);
       writer.write_batch(values, None, None).unwrap();
   
       let (_bytes_written, _rows_written, metadata) = writer.close().unwrap();
       if let Some(stats) = metadata.statistics() {
           stats.clone()
       } else {
           panic!("metadata missing statistics");
       }
   }
   ```
   
   **Note that while the tests are written for `f32`/float, this also applies to `f64`/double.**
   
   **Expected behavior**
   NaNs should be ignored during stats calculation. If only NaNs are present then min and max value should be unset.
   
   **Additional context**
   Tested commit was `8f030db53d9eda901c82db9daf94339fc447d0db`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me closed issue #255: NaNs can break parquet statistics

Posted by GitBox <gi...@apache.org>.
nevi-me closed issue #255:
URL: https://github.com/apache/arrow-rs/issues/255


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org