You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/02/26 06:47:48 UTC

[GitHub] [arrow] mapleFU commented on a diff in pull request #34054: GH-34053: [C++][Parquet] Write parquet page index

mapleFU commented on code in PR #34054:
URL: https://github.com/apache/arrow/pull/34054#discussion_r1118029417


##########
cpp/src/parquet/statistics.cc:
##########
@@ -494,6 +494,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
                       int64_t null_count, int64_t distinct_count, bool has_min_max,
                       bool has_null_count, bool has_distinct_count, MemoryPool* pool)
       : TypedStatisticsImpl(descr, pool) {
+    has_null_count_ = has_null_count;
+    has_distinct_count_ = has_distinct_count;

Review Comment:
   I meet the same problem here, I think the syntax of "has_xxx" is like that, for a writer:
   * Writer can assure that if has right null-count ( if it not has any bugs )
   * Currently I found that ndv is never collected. If a user collect ndv in page1, but not collect ndv in page 2, it should be abandon.
   
   For reader:
   * When deserialize, reader should assume that ndv and null_count can be unset ( but currently, it doesn't work like this)
   * Deserialized statistics can call merge, but if either `null_count` or `ndv` is unset, all null_count should be discarded.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org