You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/02/10 03:57:26 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

wgtmac opened a new pull request, #34112:
URL: https://github.com/apache/arrow/pull/34112

   ### Rationale for this change
   
   The code below does not read from stats.min_value/max_value at all.
   ```cpp
   // Extracts encoded statistics from V1 and V2 data page headers
   template <typename H>
   EncodedStatistics ExtractStatsFromHeader(const H& header) {
     EncodedStatistics page_statistics;
     if (!header.__isset.statistics) {
       return page_statistics;
     }
     const format::Statistics& stats = header.statistics;
     if (stats.__isset.max) {
       page_statistics.set_max(stats.max);
     }
     if (stats.__isset.min) {
       page_statistics.set_min(stats.min);
     }
     if (stats.__isset.null_count) {
       page_statistics.set_null_count(stats.null_count);
     }
     if (stats.__isset.distinct_count) {
       page_statistics.set_distinct_count(stats.distinct_count);
     }
     return page_statistics;
   }
   ```
   
   ### What changes are included in this PR?
   
   Do similar thing from parquet-mr to check and read min_value/max_value from thrift stats.
   
   ### Are these changes tested?
   
   Some test cases fail after the fix. Fixed them to make sure it is covered.
   
   ### Are there any user-facing changes?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1435283899

   Benchmark runs are scheduled for baseline = 1264e40918ba95f7c0b1725e296ada25b1385b3d and contender = 8e5e438d29868396580ef00c706cf02b1b57acb1. 8e5e438d29868396580ef00c706cf02b1b57acb1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/51e2bc82e6854fbaa2410b3b388f516f...147614dab363468db23bbb9f9bc52d2e/)
   [Failed :arrow_down:0.34% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/3a87b15df78e4314be33046eeb484393...89bddae129874825b04c79ca42977f39/)
   [Finished :arrow_down:1.02% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7a86f65e48b84b899e207635e93897ad...30135b65c5254dd48845976c47c7d9c9/)
   [Finished :arrow_down:0.54% :arrow_up:0.03%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6ec40e5b818842b58da675cd9617a078...0358ad92212b4a2fb4b710bd562875d3/)
   Buildkite builds:
   [Finished] [`8e5e438d` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2393)
   [Failed] [`8e5e438d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2423)
   [Finished] [`8e5e438d` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2391)
   [Finished] [`8e5e438d` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2415)
   [Finished] [`1264e409` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2392)
   [Finished] [`1264e409` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2422)
   [Finished] [`1264e409` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2390)
   [Finished] [`1264e409` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2414)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1428058070

   Gentle ping @wjones127 @pitrou 
   
   Once this gets merged, I will rebase https://github.com/apache/arrow/pull/34107 which is blocked by it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1433984603

   > It seems like we do have handling for these two cases. 
   
   Just for clarity, the PR I linked (and my thoughts) were about how we currently handle row group statistics.  I'm not sure if the rules are identical for page statistics.  I mainly wanted to make sure my understanding of the row group statistics wasn't invalid.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34112:
URL: https://github.com/apache/arrow/pull/34112#discussion_r1109065285


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
     return page_statistics;
   }
   const format::Statistics& stats = header.statistics;
-  if (stats.__isset.max) {
+  // Use the new V2 min-max statistics over the former one if it is filled
+  if (stats.__isset.max_value && stats.__isset.min_value) {
+    // TODO: check if the column_order is TYPE_DEFINED_ORDER.
+    page_statistics.set_max(stats.max_value);
+    page_statistics.set_min(stats.min_value);

Review Comment:
   ```suggestion
     if (stats.__isset.max_value || stats.__isset.min_value) {
       // TODO: check if the column_order is TYPE_DEFINED_ORDER.
       if (stats.__isset.max_value) {
         page_statistics.set_max(stats.max_value);
       }
       if (stats.__isset.min_value) {
         page_statistics.set_min(stats.min_value);
       }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1433773456

   @westonpace that makes sense.
   
   > When only one of min and max exists, it usually happens when a binary value has an extreme length or a floating value has NaN. In this case, the stats provide little value and make it tricker to use.
   
   It seems like we do have handling for these two cases. See Weston's message for NaN handling and `max_statistics_size` on [WriterProperties](https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet16WriterPropertiesE). Based on that, I'd actually prefer we keep the ability to parse just the min or max if only one is available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1425139885

   :warning: GitHub issue #14870 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1425139850

   * Closes: #14870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34112:
URL: https://github.com/apache/arrow/pull/34112#discussion_r1109238934


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
     return page_statistics;
   }
   const format::Statistics& stats = header.statistics;
-  if (stats.__isset.max) {
+  // Use the new V2 min-max statistics over the former one if it is filled

Review Comment:
   So we revert back to previous mode that only has min or max is ok?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1434090084

   The CI build is failed due to a recent branch rename and will be fixed by https://github.com/apache/arrow/pull/34233


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34112:
URL: https://github.com/apache/arrow/pull/34112#discussion_r1102348619


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
     return page_statistics;
   }
   const format::Statistics& stats = header.statistics;
-  if (stats.__isset.max) {
+  // Use the new V2 min-max statistics over the former one if it is filled

Review Comment:
   Yes, page stats without either min or max is corrupted and cannot be used. Check parquet-mr for detail: https://github.com/apache/parquet-mr/blob/5290bd5e0ee5dc30db0576e2bfc6eea335c465cf/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L797



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1426590211

   * Closes: #34138


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1426593092

   > This looks good. @wgtmac before we merge, could you create a new issue for this bug fix? Also, if you could create issues for the TODOs as well, that would be appreciated.
   
   I have created a new issue for this PR and updated the title.
   
   A separate issue has been created for the TODO items: https://github.com/apache/arrow/issues/34139
   
   Thanks for the review! @wjones127 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1425194955

   @pitrou @wjones127 @westonpace Could you please take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1426357816

   > Does this mean the following (testing my understanding)?
   > 
   > If min is set but max is not set then we should ignore both min and max
   
   Yes, it does mean we will. Do you foresee that as an issue? It sounds like Java implementation takes the same approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34112:
URL: https://github.com/apache/arrow/pull/34112#discussion_r1102326408


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
     return page_statistics;
   }
   const format::Statistics& stats = header.statistics;
-  if (stats.__isset.max) {
+  // Use the new V2 min-max statistics over the former one if it is filled

Review Comment:
   Previously, `page_statistics` will handle min-max separately. This patch changes it to once have all min-max, otherwise, cannot use min-max



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1426590216

   :warning: GitHub issue #34138 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1433688267

   > Yes, it does mean we will. Do you foresee that as an issue? It sounds like Java implementation takes the same approach.
   
   In datasets, for row group statistics, we [recently added a check](https://github.com/apache/arrow/pull/15125) that was roughly...
   
   ```
   if (is_nan(min) && is_nan(max)) {
     // Ignore statistics
   } else if (is_nan(min)) {
     // Assume x <= max
   } else if(is_nan(max)) {
     // Assume x >= min
   } else {
     // Assume min <= x <= max
   }
   ```
   
   In other words, if one of min or max is NaN then we still use the other side of the equality.  I think my primary concern is to validate that is a safe assumption.  In other words, I want to make sure we aren't using garbage data in our handling of row groups.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1434090362

   @westonpace Could you please take another pass?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34112:
URL: https://github.com/apache/arrow/pull/34112#discussion_r1104122046


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
     return page_statistics;
   }
   const format::Statistics& stats = header.statistics;
-  if (stats.__isset.max) {
+  // Use the new V2 min-max statistics over the former one if it is filled
+  if (stats.__isset.max_value && stats.__isset.min_value) {
+    // TODO: check if the column_order is TYPE_DEFINED_ORDER.
+    page_statistics.set_max(stats.max_value);
+    page_statistics.set_min(stats.min_value);
+  } else if (stats.__isset.max && stats.__isset.min) {
+    // TODO: check created_by to see if it is corrupted for some types.

Review Comment:
   I've no problem here, but just curious, does this code meaning the parquet-mr's `CorruptStatistics.shouldIgnoreStatistics`? (It's really trickey...)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34112:
URL: https://github.com/apache/arrow/pull/34112#discussion_r1102404651


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
     return page_statistics;
   }
   const format::Statistics& stats = header.statistics;
-  if (stats.__isset.max) {
+  // Use the new V2 min-max statistics over the former one if it is filled

Review Comment:
   Although in parquet.thrift, min-max can exist only one. But I think handling it like this is ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1433996973

   > > It seems like we do have handling for these two cases.
   > 
   > Just for clarity, the PR I linked (and my thoughts) were about how we currently handle row group statistics. I'm not sure if the rules are identical for page statistics. I mainly wanted to make sure my understanding of the row group statistics wasn't invalid.
   
   IIUC, row group statistics are aggregated from page statistics so they should share the same rules. The parquet thrift message definition does allow only one side of min or max exist:
   ```thrift
   /**
    * Statistics per row group and per page
    * All fields are optional.
    */
   struct Statistics {
      /**
       * DEPRECATED: min and max value of the column. Use min_value and max_value.
       *
       * Values are encoded using PLAIN encoding, except that variable-length byte
       * arrays do not include a length prefix.
       *
       * These fields encode min and max values determined by signed comparison
       * only. New files should use the correct order for a column's logical type
       * and store the values in the min_value and max_value fields.
       *
       * To support older readers, these may be set when the column order is
       * signed.
       */
      1: optional binary max;
      2: optional binary min;
      /** count of null value in the column */
      3: optional i64 null_count;
      /** count of distinct values occurring */
      4: optional i64 distinct_count;
      /**
       * Min and max values for the column, determined by its ColumnOrder.
       *
       * Values are encoded using PLAIN encoding, except that variable-length byte
       * arrays do not include a length prefix.
       */
      5: optional binary max_value;
      6: optional binary min_value;
   }
   ```
   
   On the other side, the story of page index is different. The column index definition does require existence of both min and max values if it is not a null page:
   ```thrift
   /**
    * Description for ColumnIndex.
    * Each <array-field>[i] refers to the page at OffsetIndex.page_locations[i]
    */
   struct ColumnIndex {
     /**
      * A list of Boolean values to determine the validity of the corresponding
      * min and max values. If true, a page contains only null values, and writers
      * have to set the corresponding entries in min_values and max_values to
      * byte[0], so that all lists have the same length. If false, the
      * corresponding entries in min_values and max_values must be valid.
      */
     1: required list<bool> null_pages
   
     /**
      * Two lists containing lower and upper bounds for the values of each page
      * determined by the ColumnOrder of the column. These may be the actual
      * minimum and maximum values found on a page, but can also be (more compact)
      * values that do not exist on a page. For example, instead of storing ""Blart
      * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C".
      * Such more compact values must still be valid values within the column's
      * logical type. Readers must make sure that list entries are populated before
      * using them by inspecting null_pages.
      */
     2: required list<binary> min_values
     3: required list<binary> max_values
   
     /**
      * Stores whether both min_values and max_values are ordered and if so, in
      * which direction. This allows readers to perform binary searches in both
      * lists. Readers cannot assume that max_values[i] <= min_values[i+1], even
      * if the lists are ordered.
      */
     4: required BoundaryOrder boundary_order
   
     /** A list containing the number of null values for each page **/
     5: optional list<i64> null_counts
   }
   ```
   
   So I am fine with parsing only one side min or max values from page/row group statistics. @westonpace @wjones127 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #34112:
URL: https://github.com/apache/arrow/pull/34112


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1432037721

   CC @fatemehp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34112: GH-14870: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1426284226

   Does this mean the following (testing my understanding)?
   
   If min is set but max is not set then we should ignore both min and max
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1426592979

   > Does this mean the following (testing my understanding)?
   > 
   > If min is set but max is not set then we should ignore both min and max
   
   When only one of min and max exists, it usually happens when a binary value has an extreme length or a floating value has NaN. In this case, the stats provide little value and make it tricker to use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34112: GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34112:
URL: https://github.com/apache/arrow/pull/34112#discussion_r1109211541


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
     return page_statistics;
   }
   const format::Statistics& stats = header.statistics;
-  if (stats.__isset.max) {
+  // Use the new V2 min-max statistics over the former one if it is filled
+  if (stats.__isset.max_value && stats.__isset.min_value) {
+    // TODO: check if the column_order is TYPE_DEFINED_ORDER.
+    page_statistics.set_max(stats.max_value);
+    page_statistics.set_min(stats.min_value);

Review Comment:
   Fixed. Please take a look again. Thanks @wjones127 !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org