You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/02/16 20:41:22 UTC

[GitHub] [arrow] westonpace commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

westonpace commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1433688267

   > Yes, it does mean we will. Do you foresee that as an issue? It sounds like Java implementation takes the same approach.
   
   In datasets, for row group statistics, we [recently added a check](https://github.com/apache/arrow/pull/15125) that was roughly...
   
   ```
   if (is_nan(min) && is_nan(max)) {
     // Ignore statistics
   } else if (is_nan(min)) {
     // Assume x <= max
   } else if(is_nan(max)) {
     // Assume x >= min
   } else {
     // Assume min <= x <= max
   }
   ```
   
   In other words, if one of min or max is NaN then we still use the other side of the equality.  I think my primary concern is to validate that is a safe assumption.  In other words, I want to make sure we aren't using garbage data in our handling of row groups.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org