You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/02/24 08:28:47 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

wgtmac opened a new pull request, #34327:
URL: https://github.com/apache/arrow/pull/34327

   ### Rationale for this change
   
   Parquet ColumnWriter obtains null_count of a page from page stats as below ([link](https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L952))
   ```cpp
     EncodedStatistics page_stats = GetPageStatistics();
   
     int32_t null_count = static_cast<int32_t>(page_stats.null_count);
   
     DataPageV2 page(combined, num_values, null_count, num_rows, encoding_,
                       def_levels_byte_length, rep_levels_byte_length, uncompressed_size,
                       pager_->has_compressor(), page_stats);
   ```
   
   However, the null_count is uninitialized if page stat is not enabled:
   ```cpp
     EncodedStatistics GetPageStatistics() override {
       EncodedStatistics result;
       if (page_statistics_) result = page_statistics_->Encode();
       return result;
     }
   ```
   
   ### What changes are included in this PR?
   
   ColumnWriter collects null_count by itself. To be safe, it also checks that from page stats if available.
   
   ### Are these changes tested?
   
   Added a test case to cover null counts of optional and repeated fields are properly set.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34327:
URL: https://github.com/apache/arrow/pull/34327#discussion_r1117109966


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -949,11 +954,17 @@ void ColumnWriterImpl::BuildDataPageV2(int64_t definition_levels_rle_size,
   ResetPageStatistics();
 
   int32_t num_values = static_cast<int32_t>(num_buffered_values_);
-  int32_t null_count = static_cast<int32_t>(page_stats.null_count);
+  int32_t null_count = static_cast<int32_t>(num_buffered_nulls_);
   int32_t num_rows = static_cast<int32_t>(num_buffered_rows_);
   int32_t def_levels_byte_length = static_cast<int32_t>(definition_levels_rle_size);
   int32_t rep_levels_byte_length = static_cast<int32_t>(repetition_levels_rle_size);
 
+  // page_stats.null_count is not set when page_statistics_ is nullptr. It is only used
+  // here for safety check.
+  if (page_stats.has_null_count && page_stats.null_count != num_buffered_nulls_) {

Review Comment:
   I agree. Will switch to `DCHECK` accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34327:
URL: https://github.com/apache/arrow/pull/34327#issuecomment-1454114747

   Benchmark runs are scheduled for baseline = a3cd962b8bc5c49f33824202dff382af13cfa873 and contender = 7c764c3ef55db034239dd9244874168587232edd. 7c764c3ef55db034239dd9244874168587232edd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/fd721793ade24b45bd037871e7b83bce...89dc2f46b95448cbbf0f52d418ab06ef/)
   [Finished :arrow_down:0.4% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/a5c62d9306cd46a19c7887cb53ad0f15...7398d61729ec4120bd259de8ef591db8/)
   [Finished :arrow_down:0.51% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2b84465ea16a46fca89ae9e666434496...697470574a9a41c58cd4e13f8dd2ea67/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0a4fe2ccbabb42f98a2cc7653e735873...d444a00c0d964a8d946a9c31a042c34d/)
   Buildkite builds:
   [Finished] [`7c764c3e` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2465)
   [Finished] [`7c764c3e` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2495)
   [Finished] [`7c764c3e` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2463)
   [Finished] [`7c764c3e` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2486)
   [Finished] [`a3cd962b` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2464)
   [Finished] [`a3cd962b` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2494)
   [Finished] [`a3cd962b` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2462)
   [Finished] [`a3cd962b` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2485)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace merged PR #34327:
URL: https://github.com/apache/arrow/pull/34327


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34327:
URL: https://github.com/apache/arrow/pull/34327#issuecomment-1451135516

   > This looks good. I appreciate the thorough testing :)
   
   Thanks for your review! @wjones127 @westonpace 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34327:
URL: https://github.com/apache/arrow/pull/34327#issuecomment-1443121339

   :warning: GitHub issue #34326 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34327:
URL: https://github.com/apache/arrow/pull/34327#issuecomment-1443128284

   @emkornfield @wjones127 @westonpace Could you please take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34327:
URL: https://github.com/apache/arrow/pull/34327#discussion_r1117100117


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1362,7 +1375,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
           *out_spaced_values_to_write +=
               def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 0;
         }
-        *null_count = *out_values_to_write - *out_spaced_values_to_write;
+        *null_count = batch_size - *out_values_to_write;

Review Comment:
   If this line does not change, the test cases `StatsTests::ParameterizedStatisticsTest` in the `arrow_statistics_test.cc` simply fail due to incorrect null_count. Here is why:
   - `out_values_to_write`: number of non-null values only
   - `out_spaced_values_to_write`: number of non-null values and null values
   - `batch_size`: number of all values including non-null values, null values, and empty list (which is null from ancestor)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34327:
URL: https://github.com/apache/arrow/pull/34327#issuecomment-1443121168

   * Closes: #34326


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34327:
URL: https://github.com/apache/arrow/pull/34327#discussion_r1117082245


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1362,7 +1375,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
           *out_spaced_values_to_write +=
               def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 0;
         }
-        *null_count = *out_values_to_write - *out_spaced_values_to_write;
+        *null_count = batch_size - *out_values_to_write;

Review Comment:
   Why did this calculation need to change?



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -949,11 +954,17 @@ void ColumnWriterImpl::BuildDataPageV2(int64_t definition_levels_rle_size,
   ResetPageStatistics();
 
   int32_t num_values = static_cast<int32_t>(num_buffered_values_);
-  int32_t null_count = static_cast<int32_t>(page_stats.null_count);
+  int32_t null_count = static_cast<int32_t>(num_buffered_nulls_);
   int32_t num_rows = static_cast<int32_t>(num_buffered_rows_);
   int32_t def_levels_byte_length = static_cast<int32_t>(definition_levels_rle_size);
   int32_t rep_levels_byte_length = static_cast<int32_t>(repetition_levels_rle_size);
 
+  // page_stats.null_count is not set when page_statistics_ is nullptr. It is only used
+  // here for safety check.
+  if (page_stats.has_null_count && page_stats.null_count != num_buffered_nulls_) {

Review Comment:
   Should this maybe be a `DCHECK` since users shouldn't be able to influence these values and wouldn't understand the exception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34327: GH-34326: [C++][Parquet] Page null_count is incorrect if stats is disabled

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34327:
URL: https://github.com/apache/arrow/pull/34327#issuecomment-1448460278

   @wjones127 @emkornfield @pitrou Could you take a look please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org