You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/04 08:32:32 UTC

[GitHub] [arrow-rs] crepererum opened a new pull request #256: fix NaN handling in parquet statistics

crepererum opened a new pull request #256:
URL: https://github.com/apache/arrow-rs/pull/256


   # Which issue does this PR close?
   Closes #225.
   
    # Rationale for this change
   Fixes NaN handling in parquet statistics. This is in line with the C++ stack:
   
   https://github.com/apache/arrow/blob/b3e43987c47b2f01b204a2d954f882f7161616ef/cpp/src/parquet/statistics_test.cc#L1000-L1043
   
   # What changes are included in this PR?
   Filter out NaN values from statistics + tests.
   
   # Are there any user-facing changes?
   Yes: formally NaN were included in the stats but at "random" (i.e. when the data started with an NaN than the min/max values are NaN, otherwise min/max are non-NaN). Now the behavior is: NaN are excluded always. If the only NaN values are present, then min/max are unset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #256: fix NaN handling in parquet statistics

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #256:
URL: https://github.com/apache/arrow-rs/pull/256#issuecomment-831780039


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/256?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#256](https://codecov.io/gh/apache/arrow-rs/pull/256?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (acd4f52) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8f030db53d9eda901c82db9daf94339fc447d0db?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8f030db) will **increase** coverage by `0.01%`.
   > The diff coverage is `90.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/256/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/256?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #256      +/-   ##
   ==========================================
   + Coverage   82.52%   82.53%   +0.01%     
   ==========================================
     Files         162      162              
     Lines       43672    43786     +114     
   ==========================================
   + Hits        36039    36139     +100     
   - Misses       7633     7647      +14     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/256?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/column/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/256/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3dyaXRlci5ycw==) | `93.76% <90.00%> (-0.27%)` | :arrow_down: |
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/256/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `81.25% <0.00%> (-0.44%)` | :arrow_down: |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/256/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `98.42% <0.00%> (+0.09%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/256?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/256?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8f030db...acd4f52](https://codecov.io/gh/apache/arrow-rs/pull/256?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me merged pull request #256: fix NaN handling in parquet statistics

Posted by GitBox <gi...@apache.org>.
nevi-me merged pull request #256:
URL: https://github.com/apache/arrow-rs/pull/256


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #256: fix NaN handling in parquet statistics

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #256:
URL: https://github.com/apache/arrow-rs/pull/256#issuecomment-831829667


   Thanks @crepererum !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] crepererum commented on a change in pull request #256: fix NaN handling in parquet statistics

Posted by GitBox <gi...@apache.org>.
crepererum commented on a change in pull request #256:
URL: https://github.com/apache/arrow-rs/pull/256#discussion_r625605221



##########
File path: parquet/src/column/writer.rs
##########
@@ -922,11 +922,14 @@ impl<T: DataType> ColumnWriterImpl<T> {
     }
 
     fn update_page_min_max(&mut self, val: &T::T) {
-        if self.min_page_value.as_ref().map_or(true, |min| min > val) {
-            self.min_page_value = Some(val.clone());
-        }
-        if self.max_page_value.as_ref().map_or(true, |max| max < val) {
-            self.max_page_value = Some(val.clone());
+        // simple "isNaN" check that works for all types
+        if val == val {

Review comment:
       Should we introduce some generic `T::is_nan` function into the data type abstraction layer (that basically returns `False` for every non-float/double type and calls `.is_nan()` for float/double) or is this local workaround good enough?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on a change in pull request #256: fix NaN handling in parquet statistics

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #256:
URL: https://github.com/apache/arrow-rs/pull/256#discussion_r626248018



##########
File path: parquet/src/column/writer.rs
##########
@@ -922,11 +922,14 @@ impl<T: DataType> ColumnWriterImpl<T> {
     }
 
     fn update_page_min_max(&mut self, val: &T::T) {
-        if self.min_page_value.as_ref().map_or(true, |min| min > val) {
-            self.min_page_value = Some(val.clone());
-        }
-        if self.max_page_value.as_ref().map_or(true, |max| max < val) {
-            self.max_page_value = Some(val.clone());
+        // simple "isNaN" check that works for all types
+        if val == val {

Review comment:
       I don't think that's necessary. It'll just be adding overhead for us.
   
   The compiler can figure out what we're trying to do, and will optimise out the branch for anything that's not a floating point number.
   
   Look at this example: https://godbolt.org/z/z7M3Mfqzz
   
   The `is_nan::<i32>()` always returns `false`, and the `ucomisd` does the floating point comparison




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #256: fix NaN handling in parquet statistics

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #256:
URL: https://github.com/apache/arrow-rs/pull/256#issuecomment-831827636


   I believe this fixes #255 rather than #225 so I updated the description accordingly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org