You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/02/09 20:33:03 UTC

[GitHub] [arrow] wgtmac opened a new pull request, #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

wgtmac opened a new pull request, #34107:
URL: https://github.com/apache/arrow/pull/34107

   ### Rationale for this change
   
   `ColumnWriter::WriteArrowDictionary` has tried to update stats but has problem if a single write has been split into batches and more than one page is written.
   
   ### What changes are included in this PR?
   
   Make sure every write of batch has updated the stats.
   
   ### Are these changes tested?
   
   Add test case which fails without the fix.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34107: GH-34106: [C++][Parquet] Fix updating page stats for WriteArrowDictionary

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34107:
URL: https://github.com/apache/arrow/pull/34107#discussion_r1110335776


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1484,6 +1484,39 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
   std::shared_ptr<::arrow::Array> dictionary = data.dictionary();
   std::shared_ptr<::arrow::Array> indices = data.indices();
 
+  auto update_stats = [&](int64_t num_chunk_levels,
+                          const std::shared_ptr<Array>& chunk_indices) {
+    // TODO(PARQUET-2068) This approach may make two copies.  First, a copy of the
+    // indices array to a (hopefully smaller) referenced indices array.  Second, a copy
+    // of the values array to a (probably not smaller) referenced values array.
+    //
+    // Once the MinMax kernel supports all data types we should use that kernel instead
+    // as it does not make any copies.
+    ::arrow::compute::ExecContext exec_ctx(ctx->memory_pool);
+    exec_ctx.set_use_threads(false);
+
+    std::shared_ptr<::arrow::Array> referenced_dictionary;
+    PARQUET_ASSIGN_OR_THROW(::arrow::Datum referenced_indices,
+                            ::arrow::compute::Unique(*chunk_indices, &exec_ctx));
+
+    // On first run, we might be able to re-use the existing dictionary
+    if (referenced_indices.length() == dictionary->length()) {
+      referenced_dictionary = dictionary;
+    } else {
+      PARQUET_ASSIGN_OR_THROW(
+          ::arrow::Datum referenced_dictionary_datum,
+          ::arrow::compute::Take(dictionary, referenced_indices,
+                                 ::arrow::compute::TakeOptions(/*boundscheck=*/false),
+                                 &exec_ctx));
+      referenced_dictionary = referenced_dictionary_datum.make_array();
+    }
+
+    int64_t non_null_count = chunk_indices->length() - chunk_indices->null_count();
+    page_statistics_->IncrementNullCount(num_chunk_levels - non_null_count);

Review Comment:
   Question: is the reason this isn't just set to `chunk_indices->null_count()` that the null count stats isn't meant to include cases where higher levels (parent fields) are null? Or something else?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34107:
URL: https://github.com/apache/arrow/pull/34107#issuecomment-1424776650

   * Closes: #34106


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34107: GH-34106: [C++][Parquet] Fix updating page stats for WriteArrowDictionary

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34107:
URL: https://github.com/apache/arrow/pull/34107#issuecomment-1438427678

   Gentle ping @wjones127 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34107: GH-34106: [C++][Parquet] Fix updating page stats for WriteArrowDictionary

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34107:
URL: https://github.com/apache/arrow/pull/34107#issuecomment-1434887666

   @westonpace @wjones127 Please take a look. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34107: GH-34106: [C++][Parquet] Fix updating page stats for WriteArrowDictionary

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34107:
URL: https://github.com/apache/arrow/pull/34107#discussion_r1110469947


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1484,6 +1484,39 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
   std::shared_ptr<::arrow::Array> dictionary = data.dictionary();
   std::shared_ptr<::arrow::Array> indices = data.indices();
 
+  auto update_stats = [&](int64_t num_chunk_levels,
+                          const std::shared_ptr<Array>& chunk_indices) {
+    // TODO(PARQUET-2068) This approach may make two copies.  First, a copy of the
+    // indices array to a (hopefully smaller) referenced indices array.  Second, a copy
+    // of the values array to a (probably not smaller) referenced values array.
+    //
+    // Once the MinMax kernel supports all data types we should use that kernel instead
+    // as it does not make any copies.
+    ::arrow::compute::ExecContext exec_ctx(ctx->memory_pool);
+    exec_ctx.set_use_threads(false);
+
+    std::shared_ptr<::arrow::Array> referenced_dictionary;
+    PARQUET_ASSIGN_OR_THROW(::arrow::Datum referenced_indices,
+                            ::arrow::compute::Unique(*chunk_indices, &exec_ctx));
+
+    // On first run, we might be able to re-use the existing dictionary
+    if (referenced_indices.length() == dictionary->length()) {
+      referenced_dictionary = dictionary;
+    } else {
+      PARQUET_ASSIGN_OR_THROW(
+          ::arrow::Datum referenced_dictionary_datum,
+          ::arrow::compute::Take(dictionary, referenced_indices,
+                                 ::arrow::compute::TakeOptions(/*boundscheck=*/false),
+                                 &exec_ctx));
+      referenced_dictionary = referenced_dictionary_datum.make_array();
+    }
+
+    int64_t non_null_count = chunk_indices->length() - chunk_indices->null_count();
+    page_statistics_->IncrementNullCount(num_chunk_levels - non_null_count);

Review Comment:
   Yes. `chunk_indices->null_count()` is the null count for the current leaf only. `num_chunk_levels - non_null_count` also counts null values from ancestor (e.g. empty list is also considered as null but not included in the `chunk_indices->null_count()`). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34107:
URL: https://github.com/apache/arrow/pull/34107#issuecomment-1424776734

   :warning: GitHub issue #34106 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34107: GH-34106: [C++][Parquet] Fix updating page stats for WriteArrowDictionary

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34107:
URL: https://github.com/apache/arrow/pull/34107#issuecomment-1439069608

   Benchmark runs are scheduled for baseline = 6850923cc56c57dac28c85088d9c49789f9ecfdc and contender = 476eb2ec40fb1c71ddf004eb60450562480803cb. 476eb2ec40fb1c71ddf004eb60450562480803cb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1425e4a7d62e46daa44c664a574c5217...7e40a6250aad47d4849fd231a1c10b38/)
   [Failed :arrow_down:0.46% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b5db3ecc8ba942349fffaaae59e02a8c...375c88761a384bd5b21619202c92c2c9/)
   [Finished :arrow_down:0.0% :arrow_up:1.02%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/e3e718594eb64d1cb7856deab8f471c3...9e036f1ad2f1475bbf4a24124ef80654/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/aadc53b690bb48daade901b4adb3550b...395a5860fbca48f6b7ac5eb960aca072/)
   Buildkite builds:
   [Finished] [`476eb2ec` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2406)
   [Failed] [`476eb2ec` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2436)
   [Finished] [`476eb2ec` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2404)
   [Finished] [`476eb2ec` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2428)
   [Finished] [`6850923c` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2405)
   [Failed] [`6850923c` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2435)
   [Finished] [`6850923c` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2403)
   [Finished] [`6850923c` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2427)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34107:
URL: https://github.com/apache/arrow/pull/34107#discussion_r1104151837


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1471,6 +1504,7 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
                       AddIfNotNull(rep_levels, offset));
     std::shared_ptr<Array> writeable_indices =
         indices->Slice(value_offset, batch_num_spaced_values);
+    update_stats(/*num_chunk_levels=*/batch_size, writeable_indices);

Review Comment:
   Well, I think here we should:
   `if (page_statistics) update_stats(/*num_chunk_levels=*/batch_size, writeable_indices);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34107:
URL: https://github.com/apache/arrow/pull/34107#discussion_r1104138363


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1471,6 +1504,7 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
                       AddIfNotNull(rep_levels, offset));
     std::shared_ptr<Array> writeable_indices =
         indices->Slice(value_offset, batch_num_spaced_values);
+    update_stats(/*num_chunk_levels=*/batch_size, writeable_indices);

Review Comment:
   Yes, please check: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1069



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #34107: GH-34106: [C++][Parquet] Fix updating page stats for WriteArrowDictionary

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #34107:
URL: https://github.com/apache/arrow/pull/34107


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34107:
URL: https://github.com/apache/arrow/pull/34107#issuecomment-1424783602

   Converted to draft because I hit another issue: https://github.com/apache/arrow/issues/14870. The C++ parquet reader does not parse column statistics correctly here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L214
   ```cpp
   // Extracts encoded statistics from V1 and V2 data page headers
   template <typename H>
   EncodedStatistics ExtractStatsFromHeader(const H& header) {
     EncodedStatistics page_statistics;
     if (!header.__isset.statistics) {
       return page_statistics;
     }
     const format::Statistics& stats = header.statistics;
     if (stats.__isset.max) {
       page_statistics.set_max(stats.max);
     }
     if (stats.__isset.min) {
       page_statistics.set_min(stats.min);
     }
     if (stats.__isset.null_count) {
       page_statistics.set_null_count(stats.null_count);
     }
     if (stats.__isset.distinct_count) {
       page_statistics.set_distinct_count(stats.distinct_count);
     }
     return page_statistics;
   }
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34107:
URL: https://github.com/apache/arrow/pull/34107#discussion_r1104129353


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1471,6 +1504,7 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
                       AddIfNotNull(rep_levels, offset));
     std::shared_ptr<Array> writeable_indices =
         indices->Slice(value_offset, batch_num_spaced_values);
+    update_stats(/*num_chunk_levels=*/batch_size, writeable_indices);

Review Comment:
   Would `page_statistics_` be a nullptr here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34107: GH-34106: Fix updating page stats for WriteArrowDictionary

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34107:
URL: https://github.com/apache/arrow/pull/34107#discussion_r1104207425


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1471,6 +1504,7 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
                       AddIfNotNull(rep_levels, offset));
     std::shared_ptr<Array> writeable_indices =
         indices->Slice(value_offset, batch_num_spaced_values);
+    update_stats(/*num_chunk_levels=*/batch_size, writeable_indices);

Review Comment:
   Good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org