You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/06/08 04:06:20 UTC

[GitHub] [arrow] mapleFU opened a new pull request, #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

mapleFU opened a new pull request, #35989:
URL: https://github.com/apache/arrow/pull/35989

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   ### What changes are included in this PR?
   
   1. This patch does some tiny optimizations on Parquet C++ Statistics. It does:
   
   ```
   For min-max, using std::string. Because assume the case like that:
   EncodedStatistics c1;
   // do some operations
   EncodedStatistics c2 = c1;
   c2.set_max("dasdasdassd");
   After c2 set, c1 would be set too. So I use std::string here.
   ```
   
   2. Force setting ndv count during merging, and add some comments
   3. Add some specification in Statistics API
   
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes
   
   ### Are there any user-facing changes?
   
   No
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229914958


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   @pitrou As Gang mentioned here
   
   1. In theory, `has_null_count_` is not added able if another statistics has `has_null_count_ = true`
   2. However, in our implementions, builder will always set `has_null_count_ == true`
   
   Maybe I can use the same way to treat them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1242452062


##########
cpp/src/parquet/statistics.cc:
##########
@@ -610,8 +619,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    if (HasNullCount()) {
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1623833092

   Gently ping @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1230503626


##########
cpp/src/parquet/statistics.cc:
##########
@@ -602,6 +617,7 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   }
 
   EncodedStatistics Encode() override {
+    // Currently, distinct_count will never be set in EncodedStatistics.

Review Comment:
   Yes we should, let me fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1231810987


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   CMIW, `null_count_` can be safely accumulated and merged, so `has_null_count_` is set to true and `null_count_` is set to 0 at its initial stage. During the building process, `has_null_count_` should always be true unless something wrong happens.
   
   On the contrary, `distinct_count_` cannot be merged and actually it is not set in the building process. It makes sense to set it to false at its initial stage.
   
   I think @mapleFU means `distinct_count_` cannot be merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1242373253


##########
cpp/src/parquet/statistics.cc:
##########
@@ -543,7 +557,7 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     ResetCounts();
     has_min_max_ = false;
     has_distinct_count_ = false;
-    has_null_count_ = false;
+    has_null_count_ = true;

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1620449603

   ```
   [ RUN      ] TwoLevelCacheWithExpirationTest.CleanupPeriodOk
   /arrow/cpp/src/parquet/encryption/two_level_cache_with_expiration_test.cc:118: Failure
   Expected equality of these values:
     lifetime2->size()
       Which is: 0
     2
   [  FAILED  ] TwoLevelCacheWithExpirationTest.CleanupPeriodOk (525 ms)
   ```
   
   No idea why this is failed...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229908532


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   Again, why the difference between `has_null_count_` and `has_distinct_count_`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229957716


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   Hmm... sorry, but I don't understand what you mean with "not added able".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229929330


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.
+    if (has_distinct_count_) {
+      has_distinct_count_ = false;
     }

Review Comment:
   Simply:
   ```suggestion
       has_distinct_count_ = false;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1598859724

   https://github.com/apache/arrow/pull/35989#issuecomment-1581985853 @pitrou @wgtmac Should we trying to fix this?
   
   when `!metadata.statistics.__isset.null_count`, the `num_values` would be wrong


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1581935377

   Some notes:
   1. Builder of `Statistics` will always have `has_null_count_`, and ignore `has_distinct_count_` (not build when building `EncodedStatistics`). `has_min_max_` will be `false` if page is all nulls or nulls and NaN
   2. Reader will able to set `distinct_count` and `has_distinct_count`. Which would be valid, but when it call `Encoded`, the status will not be included in `EncodedStatistics`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252560357


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics read from thrift, it might not have null_count,
+  // and page merge with it should also not have null_count.
+  void TestMergeNullCount() {
+    this->GenerateData(1000);

Review Comment:
   ```suggestion
       this->GenerateData(/*num_values=*/1000);
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics read from thrift, it might not have null_count,
+  // and page merge with it should also not have null_count.
+  void TestMergeNullCount() {
+    this->GenerateData(1000);
+
+    // Page should have null-count even if no nulls
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/this->values_.size(),
+                          /*null_count=*/0);
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_TRUE(statistics1->HasNullCount());
+      EXPECT_EQ(0, statistics1->null_count());
+      EXPECT_TRUE(statistics1->Encode().has_null_count);
+    }
+    // Merge with null-count should also have null count
+    VerifyMergedStatistics(*statistics1, *statistics1,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasNullCount());
+                             EXPECT_EQ(0, merged_statistics->null_count());
+                             auto encoded = merged_statistics->Encode();
+                             EXPECT_TRUE(encoded.has_null_count);
+                             EXPECT_EQ(0, encoded.null_count);
+                           });
+
+    // When loaded from thrift, might not have null count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_null_count = false;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics2->Encode().has_null_count);
+      EXPECT_FALSE(statistics2->HasNullCount());
+    }
+
+    // Merge without null-count should not have null count
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasNullCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_null_count);
+                           });
+  }
+
+  // statistics.all_null_value is used to build the page index.
+  // If statistics doesn't have null count, all_null_value should be false.
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }

Review Comment:
   ```suggestion
     void TestNotHasNullValue() {
       EncodedStatistics encoded_statistics;
       encoded_statistics.has_null_count = false;
       auto statistics = Statistics::Make(this->schema_.Column(0), &encoded_statistics,
                                          /*num_values=*/1000);
       auto typed_stats = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics);
       EXPECT_FALSE(typed_stats->HasNullCount());
       auto encoded = typed_stats->Encode();
       EXPECT_FALSE(encoded.all_null_value);
     }
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics read from thrift, it might not have null_count,
+  // and page merge with it should also not have null_count.

Review Comment:
   ```suggestion
     // Default statistics should have null_count even if no nulls is written.
     // However, if statistics is created from thrift message, it might not
     // have null_count. Merging statistics from such page will result in an
     // invalid null_count as well.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +555,18 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   ```suggestion
       // null_count is always valid when merging page statistics into
       // column chunk statistics.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   OK, thanks for confirming!



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());

Review Comment:
   ```suggestion
                             /*null_count=*/ this->values_.size());
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());

Review Comment:
   Thanks for the info!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1581985853

   Another problem is that:
   
   ```c++
   template <typename DType>
   static std::shared_ptr<Statistics> MakeTypedColumnStats(
       const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
     // If ColumnOrder is defined, return max_value and min_value
     if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER) {
       return MakeStatistics<DType>(
           descr, metadata.statistics.min_value, metadata.statistics.max_value,
           metadata.num_values - metadata.statistics.null_count,
           metadata.statistics.null_count, metadata.statistics.distinct_count,
           metadata.statistics.__isset.max_value || metadata.statistics.__isset.min_value,
           metadata.statistics.__isset.null_count,
           metadata.statistics.__isset.distinct_count);
     }
     // Default behavior
     return MakeStatistics<DType>(
         descr, metadata.statistics.min, metadata.statistics.max,
         metadata.num_values - metadata.statistics.null_count,
         metadata.statistics.null_count, metadata.statistics.distinct_count,
         metadata.statistics.__isset.max || metadata.statistics.__isset.min,
         metadata.statistics.__isset.null_count, metadata.statistics.__isset.distinct_count);
   }
   ```
   
   For the Arrow write file, this is ok, however, when `!metadata.statistics.__isset.null_count`, the `num_values` would be wrong. Currently I don't know how to fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1237059751


##########
cpp/src/parquet/statistics.h:
##########
@@ -166,13 +167,13 @@ class PARQUET_EXPORT EncodedStatistics {
   void set_is_signed(bool is_signed) { is_signed_ = is_signed; }
 
   EncodedStatistics& set_max(const std::string& value) {
-    *max_ = value;
+    max_ = value;

Review Comment:
   If you want to further micro-optimize:
   ```c++
     EncodedStatistics& set_max(std::string value) {
       max_ = std::move(value);
       has_max = true;
       return *this;
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1602983417

   Emmm I've tried it, I think `metadata.__isset.statistics && !metadata.statistics.__isset.null_count` might means it has null but not set it's count. We use `metadata.num_values - metadata.statistics.null_count` to get `num_values` in `Statistics` (`num_values` is just used to build page index )
   
    If all writer can make sure that `metadata.__isset.statistics && !metadata.statistics.__isset.null_count` means no nulls, the currently implemention is right. If we regard `num_values` in `Statistics` as "wrongable", it's also ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1234058063


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1346,7 +1346,12 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
 
   EncodedStatistics GetChunkStatistics() override {
     EncodedStatistics result;
-    if (chunk_statistics_) result = chunk_statistics_->Encode();
+    if (chunk_statistics_) {
+      if (has_dictionary_ && !fallback_) {
+        chunk_statistics_->SetDistinctCount(current_dict_encoder_->num_entries());

Review Comment:
   Ok so let me just remove it. I can use `distinct_count` to inspect dictionary, but I guess it's really useless



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1601372945

   Emm why gandiva test failed...
   
   ```
   [----------] 2 tests from TestExpressionRegistry
   [ RUN      ] TestExpressionRegistry.VerifySupportedFunctions
   /Users/runner/work/arrow/arrow/cpp/src/gandiva/expression_registry_test.cc:48: Failure
   Expected: (element) != (functions.end()), actual: 8-byte object <40-9D 8C-24 E1-7F 00-00> vs 8-byte object <40-9D 8C-24 E1-7F 00-00>
   function signature date64[ms] date_diff(date64[ms], int64) missing in supported functions.
   
   /Users/runner/work/arrow/arrow/cpp/src/gandiva/expression_registry_test.cc:48: Failure
   Expected: (element) != (functions.end()), actual: 8-byte object <40-9D 8C-24 E1-7F 00-00> vs 8-byte object <40-9D 8C-24 E1-7F 00-00>
   function signature timestamp[ms] date_diff(timestamp[ms], int64) missing in supported functions.
   
   [  FAILED  ] TestExpressionRegistry.VerifySupportedFunctions (166 ms)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1237393575


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    // NOTE: currently, has_null_count_ in Statistics is always true.
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252484562


##########
cpp/src/parquet/statistics.cc:
##########
@@ -609,9 +618,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     }
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    // Currently, distinct count would not be written.
+    // So, not call set_distinct_count.

Review Comment:
   Sorry my bad! Probably change the comment into a `TODO` or `FIXME`? Something like:
   
   // TODO: distinct count is enabled and thus not encoded for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1232972393


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1346,7 +1346,12 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
 
   EncodedStatistics GetChunkStatistics() override {
     EncodedStatistics result;
-    if (chunk_statistics_) result = chunk_statistics_->Encode();
+    if (chunk_statistics_) {
+      if (has_dictionary_ && !fallback_) {
+        chunk_statistics_->SetDistinctCount(current_dict_encoder_->num_entries());

Review Comment:
   @pitrou I think this is the proper way to set distinct count. Another method is
   
   ```c++
   if (chunk_statistics_) { 
     result = chunk_statistics_->Encode();
     if (has_dictionary_ && !fallback_) {
       result.set_distinct_count(...);
     }
   }
   ```
   
   Do you think this is ok?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1597137181

   (and, yes, `distinct_count = 0` is obviously wrong here :-))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1245458806


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +380,79 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+    EXPECT_FALSE(statistics1->HasNullCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    EXPECT_FALSE(s1->HasNullCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());
+    EXPECT_FALSE(total->HasNullCount());
+  }
+
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+
+    {
+      auto page_statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      std::vector<uint8_t> valid_bits(
+          bit_util::BytesForBits(static_cast<uint32_t>(this->values_.size())) + 1, 0);

Review Comment:
   `valid_bits` is not even used, is it?



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +380,79 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+    EXPECT_FALSE(statistics1->HasNullCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    EXPECT_FALSE(s1->HasNullCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());
+    EXPECT_FALSE(total->HasNullCount());
+  }
+
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+
+    {
+      auto page_statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      std::vector<uint8_t> valid_bits(
+          bit_util::BytesForBits(static_cast<uint32_t>(this->values_.size())) + 1, 0);
+      page_statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                               /*null_count*/ this->values_.size());
+      auto encoded_stats1 = page_statistics1->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+
+      chunk_statistics->Merge(*page_statistics1);
+      encoded_stats1 = chunk_statistics->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    {
+      auto page_statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      page_statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+
+      chunk_statistics->Merge(*page_statistics2);
+      auto encoded_stats2 = chunk_statistics->Encode();
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);

Review Comment:
   Hmm... why are those true? Also, why not also check the actual min/max values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1607718489

   @pitrou @wgtmac I've fix the comments, mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1245504781


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +380,79 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+    EXPECT_FALSE(statistics1->HasNullCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    EXPECT_FALSE(s1->HasNullCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());
+    EXPECT_FALSE(total->HasNullCount());
+  }
+
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+
+    {
+      auto page_statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      std::vector<uint8_t> valid_bits(
+          bit_util::BytesForBits(static_cast<uint32_t>(this->values_.size())) + 1, 0);
+      page_statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                               /*null_count*/ this->values_.size());
+      auto encoded_stats1 = page_statistics1->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+
+      chunk_statistics->Merge(*page_statistics1);
+      encoded_stats1 = chunk_statistics->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    {
+      auto page_statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      page_statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+
+      chunk_statistics->Merge(*page_statistics2);
+      auto encoded_stats2 = chunk_statistics->Encode();
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);

Review Comment:
   Or we can distinct from `all is null or nan` and `not has min-max, cannot be merged`. I prefer to compatible with previous currently



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1582101050

   cc @wgtmac @pitrou would you mind take a look? Since `Statistics` is released, I'm afraid my change will break some invariant...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1241374931


##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // Currently, this constructor will tend to be called by ColumnWriter to create the

Review Comment:
   I still don't think comment with `currently` and `tend` is a good idea as it may easily be out of sync and confuses users. It would be good enough to simply describe its supposed use case.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -543,7 +557,7 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     ResetCounts();
     has_min_max_ = false;
     has_distinct_count_ = false;
-    has_null_count_ = false;
+    has_null_count_ = true;

Review Comment:
   What about adding something like `ResetHasFlags()` below?
   ```cpp
   void ResetHasFlags() {
     has_min_max_ = false;
     has_distinct_count_ = false;
     has_null_count_ = true;
   }
   ```
   We can call it in the constructors and Reset() to make sure the initial states are consistent.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +568,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.

Review Comment:
   ```suggestion
       // into column chunk statistics, so it tends to have null count.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1230364650


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.

Review Comment:
   I can discard it and add support for it using `set` later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1237057036


##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // Currently, this function will tend to be called by ColumnWriter to create the

Review Comment:
   Nit (and same below)
   ```suggestion
     // Currently, this constructor will tend to be called by ColumnWriter to create the
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1601378812

   > Emm why gandiva test failed...
   
   That happens from time to time on the macOS CI, it's unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1222625483


##########
cpp/src/parquet/statistics.h:
##########
@@ -149,10 +148,10 @@ class PARQUET_EXPORT EncodedStatistics {
   // the true minimum for aggregations and there is no way to mark that a
   // value has been truncated and is a lower bound and not in the page.
   void ApplyStatSizeLimits(size_t length) {
-    if (max_->length() > length) {
+    if (max_.length() > length) {

Review Comment:
   Should we clear `max_` as well?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // Currently, this function will call by ColumnWriter to create the

Review Comment:
   ```suggestion
     // Currently, this function will be called by ColumnWriter to create the
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   nit: has_distinct_count_ is set to false in `TypedStatisticsImpl::Reset();`



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {

Review Comment:
   Should we check this->has_null_count_ as well?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;
   }
 
+  // Currently, this function will call by testing.
   TypedStatisticsImpl(const T& min, const T& max, int64_t num_values, int64_t null_count,
                       int64_t distinct_count)
       : pool_(default_memory_pool()),
         min_buffer_(AllocateBuffer(pool_, 0)),
         max_buffer_(AllocateBuffer(pool_, 0)) {
     TypedStatisticsImpl::IncrementNumValues(num_values);
     TypedStatisticsImpl::IncrementNullCount(null_count);
-    IncrementDistinctCount(distinct_count);
+    SetDistinctCount(distinct_count);
 
     Copy(min, &min_, min_buffer_.get());
     Copy(max, &max_, max_buffer_.get());
     has_min_max_ = true;
   }
 
+  // Currently, this function will call by ColumnChunkMetaData during read.

Review Comment:
   ```suggestion
     // Currently, this function will only be called by ColumnChunkMetaData during read.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.
+    if (has_distinct_count_) {
+      has_distinct_count_ = false;
     }
+    // If !other.HasMinMax, might be all-nulls or nulls and nan,

Review Comment:
   We cannot differentiate at least two states:
   - (1) the stats has not seen any value except null or nan, so it does not have min/max
   - (2) the stats has seen some valid values but min/max are truncated due to size limit
   
   So if currently the stats is in the state (2), we should not merge min/max from the incoming stats



##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;
   }
 
+  // Currently, this function will call by testing.

Review Comment:
   ```suggestion
     // Currently, this function will only be called by testing.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1234041242


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1346,7 +1346,12 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
 
   EncodedStatistics GetChunkStatistics() override {
     EncodedStatistics result;
-    if (chunk_statistics_) result = chunk_statistics_->Encode();
+    if (chunk_statistics_) {
+      if (has_dictionary_ && !fallback_) {
+        chunk_statistics_->SetDistinctCount(current_dict_encoder_->num_entries());

Review Comment:
   I _think_ it's ok. But I don't think it's very important to write the distinct count, if we're only able to write it for dictionary-encoded chunks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1237552645


##########
cpp/src/parquet/statistics.cc:
##########
@@ -543,7 +557,7 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     ResetCounts();
     has_min_max_ = false;
     has_distinct_count_ = false;
-    has_null_count_ = false;
+    has_null_count_ = true;

Review Comment:
   page statistics will call `Reset`, this will clear `has_null_count_`, however, this time it should have `has_null_count_ = 0`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1238768282


##########
cpp/src/parquet/statistics.cc:
##########
@@ -471,35 +473,47 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     auto comp = Comparator::Make(descr);
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
+    // Currently, writer will always write `distinct_count` to EncodedStatistics.

Review Comment:
   Yes, my fault



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1602956858

   @mapleFU I would request that you add detailed comments to explain the non-trivial changes here. We don't want to misunderstand this again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1245508465


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +380,79 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+    EXPECT_FALSE(statistics1->HasNullCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    EXPECT_FALSE(s1->HasNullCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());
+    EXPECT_FALSE(total->HasNullCount());
+  }
+
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+
+    {
+      auto page_statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      std::vector<uint8_t> valid_bits(
+          bit_util::BytesForBits(static_cast<uint32_t>(this->values_.size())) + 1, 0);
+      page_statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                               /*null_count*/ this->values_.size());
+      auto encoded_stats1 = page_statistics1->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+
+      chunk_statistics->Merge(*page_statistics1);
+      encoded_stats1 = chunk_statistics->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    {
+      auto page_statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      page_statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+
+      chunk_statistics->Merge(*page_statistics2);
+      auto encoded_stats2 = chunk_statistics->Encode();
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);

Review Comment:
   ```c++
     void SetMinMaxPair(std::pair<T, T> min_max) {
       // CleanStatistic can return a nullopt in case of erroneous values, e.g. NaN
       auto maybe_min_max = CleanStatistic(min_max);
       if (!maybe_min_max) return;
       auto min = maybe_min_max.value().first;
       auto max = maybe_min_max.value().second;
       if (!has_min_max_) {
         has_min_max_ = true;
         Copy(min, &min_, min_buffer_.get());
         Copy(max, &max_, max_buffer_.get());
       } else {
         Copy(comparator_->Compare(min_, min) ? min_ : min, &min_, min_buffer_.get());
         Copy(comparator_->Compare(max_, max) ? max : max_, &max_, max_buffer_.get());
       }
     }
   ```
   
   If `other` has min-max, our code will always set `has_min_max_ = true` here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #35989: GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1634089457

   Conbench analyzed the 6 benchmark runs on commit `60fdc255`.
   
   There were 4 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-07-10 11:16:53Z](http://conbench.ursa.dev/compare/runs/438eab56ace443febcd2f8882004c20e...1ce551ff811b45d8b9fbb7758bf1c058/)
     - [source=cpp-micro, suite=parquet-level-conversion-benchmark](http://conbench.ursa.dev/compare/benchmarks/064abbfea418770d80007ab8f6737bd9...064abe95ceed71a38000e7b302f723a9)
     - [source=cpp-micro, suite=parquet-level-conversion-benchmark](http://conbench.ursa.dev/compare/benchmarks/064abbfea2f172318000753271cbad9b...064abe95cdce7d85800082f214c30f06)
   - and 2 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/15011499682) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252492049


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics read from thrift, it might not have null_count,
+  // and page merge with it should also not have null_count.
+  void TestMergeNullCount() {
+    this->GenerateData(1000);
+
+    // Page should have null-count even if no nulls
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/this->values_.size(),
+                          /*null_count=*/0);
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_TRUE(statistics1->HasNullCount());
+      EXPECT_EQ(0, statistics1->null_count());
+      EXPECT_TRUE(statistics1->Encode().has_null_count);
+    }
+    // Merge with null-count should also have null count
+    VerifyMergedStatistics(*statistics1, *statistics1,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasNullCount());
+                             EXPECT_EQ(0, merged_statistics->null_count());
+                             auto encoded = merged_statistics->Encode();
+                             EXPECT_TRUE(encoded.has_null_count);
+                             EXPECT_EQ(0, encoded.null_count);
+                           });
+
+    // When loaded from thrift, might not have null count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_null_count = false;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics2->Encode().has_null_count);
+      EXPECT_FALSE(statistics2->HasNullCount());
+    }
+
+    // Merge without null-count should not have null count
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasNullCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_null_count);
+                           });
+  }
+
+  // statistics.all_null_value is used to build the page index.
+  // If statistics doesn't have null count, all_null_value should be false.
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);

Review Comment:
   Check other attributes of `encoded` as well?



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics read from thrift, it might not have null_count,
+  // and page merge with it should also not have null_count.
+  void TestMergeNullCount() {
+    this->GenerateData(1000);
+
+    // Page should have null-count even if no nulls
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/this->values_.size(),
+                          /*null_count=*/0);
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_TRUE(statistics1->HasNullCount());
+      EXPECT_EQ(0, statistics1->null_count());
+      EXPECT_TRUE(statistics1->Encode().has_null_count);
+    }
+    // Merge with null-count should also have null count
+    VerifyMergedStatistics(*statistics1, *statistics1,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasNullCount());
+                             EXPECT_EQ(0, merged_statistics->null_count());
+                             auto encoded = merged_statistics->Encode();
+                             EXPECT_TRUE(encoded.has_null_count);
+                             EXPECT_EQ(0, encoded.null_count);
+                           });
+
+    // When loaded from thrift, might not have null count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_null_count = false;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics2->Encode().has_null_count);
+      EXPECT_FALSE(statistics2->HasNullCount());
+    }
+
+    // Merge without null-count should not have null count
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasNullCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_null_count);
+                           });
+  }
+
+  // statistics.all_null_value is used to build the page index.
+  // If statistics doesn't have null count, all_null_value should be false.
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+};
+
+TYPED_TEST_SUITE(TestStatisticsHasFlag, Types);
+
+TYPED_TEST(TestStatisticsHasFlag, MergeDistinctCount) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeDistinctCount());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, MergeNullCount) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeNullCount());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, TestMergeMinMax) {

Review Comment:
   ```suggestion
   TYPED_TEST(TestStatisticsHasFlag, MergeMinMax) {
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());

Review Comment:
   ```suggestion
                                 const std::function<void(TypedStatistics<TestType>*)>& test_fn) {
       ASSERT_NO_FATAL_FAILURE(test_fn(MergeStatistics(stats1, stats2).get()));
       ASSERT_NO_FATAL_FAILURE(test_fn(MergeStatistics(stats2, stats1).get()));
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {

Review Comment:
   It would be good if we can test all flags in these cases even when the main purpose of a specific case is not for those flags.



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(

Review Comment:
   ```suggestion
     std::shared_ptr<TypedStatistics<TestType>> MergeStatistics(
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics read from thrift, it might not have null_count,
+  // and page merge with it should also not have null_count.
+  void TestMergeNullCount() {
+    this->GenerateData(1000);
+
+    // Page should have null-count even if no nulls
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/this->values_.size(),
+                          /*null_count=*/0);
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_TRUE(statistics1->HasNullCount());
+      EXPECT_EQ(0, statistics1->null_count());
+      EXPECT_TRUE(statistics1->Encode().has_null_count);
+    }
+    // Merge with null-count should also have null count
+    VerifyMergedStatistics(*statistics1, *statistics1,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasNullCount());
+                             EXPECT_EQ(0, merged_statistics->null_count());
+                             auto encoded = merged_statistics->Encode();
+                             EXPECT_TRUE(encoded.has_null_count);
+                             EXPECT_EQ(0, encoded.null_count);
+                           });
+
+    // When loaded from thrift, might not have null count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_null_count = false;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics2->Encode().has_null_count);
+      EXPECT_FALSE(statistics2->HasNullCount());
+    }
+
+    // Merge without null-count should not have null count
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasNullCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_null_count);
+                           });
+  }
+
+  // statistics.all_null_value is used to build the page index.
+  // If statistics doesn't have null count, all_null_value should be false.
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+};
+
+TYPED_TEST_SUITE(TestStatisticsHasFlag, Types);
+
+TYPED_TEST(TestStatisticsHasFlag, MergeDistinctCount) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeDistinctCount());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, MergeNullCount) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeNullCount());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, TestMergeMinMax) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeMinMax());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, NotHasNullValues) {
+  ASSERT_NO_FATAL_FAILURE(this->TestNotHasNullValue());

Review Comment:
   ```suggestion
   TYPED_TEST(TestStatisticsHasFlag, MissingNullCount) {
     ASSERT_NO_FATAL_FAILURE(this->TestMissingNullCount());
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -609,9 +618,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     }
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    // Currently, distinct count would not be written.
+    // So, not call set_distinct_count.

Review Comment:
   Sorry my bad! Probably change the comment into a `TODO` or `FIXME`? Something like:
   
   // TODO: distinct count is not encoded for now.



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());

Review Comment:
   CMIW, directly calling `.get()` on the shared_ptr may introduce lifetime issue if the returned shared_ptr object is released before fn uses the raw pointer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252200622


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   I know what you mean. This is a bit confusing:
   
   1. `IncrementNullCount` will set `has_null_count_` to true ( but I think it's not neccessary ). The `page_statistics_` would call `IncrementNullCount` to set it( although at the first time `has_null_count_` would be true for it
   2. `chunk_statistics_` will only call `merge`, once it meets a `Statistics` without `has_null_count_`, it's `has_null_count_` would never be true.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   I know what you mean. This is a bit confusing:
   
   1. `IncrementNullCount` will set `has_null_count_` to true ( but I think it's not neccessary ). The `page_statistics_` would call `IncrementNullCount` to set it( although at the first time `has_null_count_` would be true for it)
   2. `chunk_statistics_` will only call `merge`, once it meets a `Statistics` without `has_null_count_`, it's `has_null_count_` would never be true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252197422


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   > If Merge is called twice with other.HasNullCount()=false and then other.HasNullCount()=true, here may have problem. 
   
   Why? If `other.HasNullCount()` called, the flag would never be true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1602834116

   @pitrou @wgtmac Mind take a look? I've update the logic for `num_values` within `Statistics`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1238749226


##########
cpp/src/parquet/metadata.cc:
##########
@@ -87,20 +87,29 @@ std::string ParquetVersionToString(ParquetVersion::type ver) {
 template <typename DType>
 static std::shared_ptr<Statistics> MakeTypedColumnStats(
     const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
+  DCHECK(metadata.__isset.statistics);
+  int64_t num_non_null_values = 0;
+  if (metadata.statistics.__isset.null_count) {
+    num_non_null_values = metadata.num_values - metadata.statistics.null_count;
+  } else {
+    // NOTE: statistics should `metadata.statistics.__isset.null_count`
+    // has a invalid non-null value number.
+    num_non_null_values = metadata.num_values;

Review Comment:
   Why change this? `has_null_count` will be false anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1238750566


##########
cpp/src/parquet/statistics.cc:
##########
@@ -471,35 +473,47 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     auto comp = Comparator::Make(descr);
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
+    // Currently, writer will always write `distinct_count` to EncodedStatistics.

Review Comment:
   Do you mean `null_count`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229914958


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   @pitrou
   
   1. In theory, `has_null_count_` is not added able if another statistics has `has_null_count_ = true`
   2. However, in our implementions, builder will always set `has_null_count_ == true`
   
   Maybe I can use the same way to treat them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1596204490

   > data = [1, 2, 2, None, 4], type = DataType(uint16), physical_type = 'INT32'
   min_value = 1, max_value = 4, null_count = 1, num_values = 4, distinct_count = 0
   
   After my reflect, the `distinct_count` need to be fixed. Can you confirm that @pitrou ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1613450739

   @pitrou Sorry for writing ugly test yesterday, I didn't make clear the required testing. Now I tried to organize them and added some comments. Would you mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252195411


##########
cpp/src/parquet/statistics.cc:
##########
@@ -609,9 +618,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     }
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    // Currently, distinct count would not be written.
+    // So, not call set_distinct_count.

Review Comment:
   Hmmm that's merge. I think Merge is different from Reset



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252537447


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());

Review Comment:
   > CMIW, directly calling .get() on the shared_ptr may introduce lifetime issue if the returned shared_ptr object is released before fn uses the raw pointer.
   
   See https://en.cppreference.com/w/cpp/language/lifetime#Temporary_object_lifetime
   
   > All temporary objects are destroyed as the last step in evaluating the [full-expression](https://en.cppreference.com/w/cpp/language/expressions#Full-expressions) that (lexically) contains the point where they were created, and if multiple temporary objects were created, they are destroyed in the order opposite to the order of creation. This is true even if that evaluation ends in throwing an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229897390


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.

Review Comment:
   https://github.com/apache/arrow/commit/43c7154894f819e617dbf3d669dbe1185f05893b#diff-f8a090a40e214aba5ebff534411a9febf213b725597afb750395d2a7483d9bc1
   
   I guess these code is written in 2016 and never changed the syntax for add distinct count. Maybe it's designed like dictionary in Arrow IPC Streaming?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229893119


##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // Currently, this function will be called by ColumnWriter to create the
+  // statistics collector during write.

Review Comment:
   I guess it's ok but it makes cleaning the statistics much more complex...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229842212


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_distinct_count = false;
+    // NOTE: currently, has_null_count_ in Statistics is always true.
+    auto statistics1 =
+        Statistics::Make(this->schema_.Column(0), &encoded_statistics1, 1000);

Review Comment:
   Can you add the parameter name for readability?
   ```suggestion
           Statistics::Make(this->schema_.Column(0), &encoded_statistics1, /*xxx=*/ 1000);
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_distinct_count = false;
+    // NOTE: currently, has_null_count_ in Statistics is always true.
+    auto statistics1 =
+        Statistics::Make(this->schema_.Column(0), &encoded_statistics1, 1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 =
+        Statistics::Make(this->schema_.Column(0), &encoded_statistics2, 1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());

Review Comment:
   Also perhaps
   ```suggestion
       EXPECT_FALSE(total->HasDistinctCount());
       EXPECT_FALSE(total->HasMinMax());
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // Currently, this function will be called by ColumnWriter to create the
+  // statistics collector during write.

Review Comment:
   These constructors can be invoked indirectly by third-party code through `MakeStatistics`, so the added comments seem misleading.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {

Review Comment:
   This is really a pointless micro-optimization, IMHO. We are not in a critical loop.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -602,6 +617,7 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   }
 
   EncodedStatistics Encode() override {
+    // Currently, distinct_count will never be set in EncodedStatistics.

Review Comment:
   Why? Should we fix this? It doesn't seem right.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   It doesn't seem consistent to set `has_null_count_` and `has_distinct_count_` to different values by default. Why this change?



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_distinct_count = false;

Review Comment:
   This doesn't test the semantic change you're making in this PR, is it?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.

Review Comment:
   I understand the reasoning, but this might break existing usage. Perhaps we should instead document that the distinct count can be an upper bound?
   
   Can you use `git log` or `git blame` to find out who touched this code previously, and ping them to get their opinion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1231910378


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   Ok, fair enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1232972212


##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // Currently, this function will be called by ColumnWriter to create the
+  // statistics collector during write.

Review Comment:
   I change *will be* to *will tend to be*, I think this is more proper here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1242452436


##########
cpp/src/parquet/statistics.cc:
##########
@@ -627,7 +640,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   T min_;
   T max_;
   ::arrow::MemoryPool* pool_;
-  int64_t num_values_ = 0;  // # of non-null values.
+  // # of non-null values.
+  // num_values_ would be reliable when has_null_count_,
+  // if it's created from a page thrift statistics, and statistics
+  // doesn't have null_count, but page has null data, `num_values_`
+  // would be equal to value including nulls.

Review Comment:
   Tried it but maybe I'm not good for docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1238754602


##########
cpp/src/parquet/statistics.cc:
##########
@@ -610,8 +629,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    if (HasNullCount()) {

Review Comment:
   Why would `num_values` be wrong?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1245493622


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +380,79 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+    EXPECT_FALSE(statistics1->HasNullCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    EXPECT_FALSE(s1->HasNullCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());
+    EXPECT_FALSE(total->HasNullCount());
+  }
+
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+
+    {
+      auto page_statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      std::vector<uint8_t> valid_bits(
+          bit_util::BytesForBits(static_cast<uint32_t>(this->values_.size())) + 1, 0);

Review Comment:
   I'll remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1611764690

   Emmm what kind of test should I added here?
   1. I test the behavior when `ResetFlag` is called
   2. `num_distinct_value` and other being merged
   3. `min-max` beging merged
   4. `null-value`
   
   Should I then add:
   1. `null-count` and merge
   2. behavior of different constructor
   
   Is that enough?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1237061771


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    // NOTE: currently, has_null_count_ in Statistics is always true.
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());

Review Comment:
   Can you tell the null count as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1237561942


##########
cpp/src/parquet/statistics.cc:
##########
@@ -610,8 +629,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    if (HasNullCount()) {

Review Comment:
   This refactor is a bit tricky. When `ColumnChunkMetadata` read a statistics from thrift, and if the thrift not have `null_count`, the `num_values` would be wrong. And in C++ Parquet, we'll never meet this case, because writer always has `has_null_count_`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1238768978


##########
cpp/src/parquet/statistics.cc:
##########
@@ -610,8 +629,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    if (HasNullCount()) {

Review Comment:
   Can a `!metadata.statistics.__isset.null_count` might means:
   
   1. The underlying data has null
   2. But it doesn't write it into statistics
   
   If it can, when `!metadata.statistics.__isset.null_count`, `metadata.num_values - metadata.statistics.null_count` cannot represent the non-null count



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1607822392

   I've fixed the comments and added some tests @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1237394411


##########
cpp/src/parquet/metadata.cc:
##########
@@ -87,20 +87,28 @@ std::string ParquetVersionToString(ParquetVersion::type ver) {
 template <typename DType>
 static std::shared_ptr<Statistics> MakeTypedColumnStats(
     const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
+  DCHECK(metadata.__isset.statistics);
+  int64_t num_values = 0;

Review Comment:
   I've handle it here. Maybe it's not right. But I think it's compatible with previous code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1238745822


##########
cpp/src/parquet/metadata.cc:
##########
@@ -87,20 +87,29 @@ std::string ParquetVersionToString(ParquetVersion::type ver) {
 template <typename DType>
 static std::shared_ptr<Statistics> MakeTypedColumnStats(
     const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
+  DCHECK(metadata.__isset.statistics);
+  int64_t num_non_null_values = 0;
+  if (metadata.statistics.__isset.null_count) {
+    num_non_null_values = metadata.num_values - metadata.statistics.null_count;
+  } else {
+    // NOTE: statistics should `metadata.statistics.__isset.null_count`
+    // has a invalid non-null value number.

Review Comment:
   What is a "invalid non-null value number"? I don't understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229429805


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   Yes this change the syntax for `has_distinct_count_`, but currently `has_distinct_count_` is never used by writer...



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.
+    if (has_distinct_count_) {
+      has_distinct_count_ = false;
     }
+    // If !other.HasMinMax, might be all-nulls or nulls and nan,

Review Comment:
   The interesting thing is, currently, "merge" is only called by writer, and writer is always the case (1)...
   @pitrou Do you have any idea here? Since `Statistics` is already public.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {

Review Comment:
   if `!this->has_null_count_`, adding a number is non-sense, so I think it's ok to keep it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229925687


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.

Review Comment:
   Hmm... ok. I took a look at parquet-mr and the Rust Parquet implementation, neither of them seems to handle `distinct_count`. So I guess this is ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229929061


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +560,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
-    if (other.HasNullCount()) {
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it tent to have null count.
+    if (ARROW_PREDICT_TRUE(other.HasNullCount())) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    // Distinct count cannot be merged.
+    if (has_distinct_count_) {
+      has_distinct_count_ = false;
     }
+    // If !other.HasMinMax, might be all-nulls or nulls and nan,

Review Comment:
   I think we can keep the code as-is indeed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1254663242


##########
cpp/src/parquet/statistics.cc:
##########
@@ -609,9 +618,10 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     }
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    // FIXME(mwish): distinct count is not encoded for now.

Review Comment:
   ```suggestion
       // TODO (GH-36505): distinct count is not encoded for now.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1597136354

   @mapleFU How was this data produced?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1597157730

   From here: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47327471
   Let me draft the fix the testing here...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1229891643


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   Emmm currently writer **never** set `has_distinct_count_`, so I change it as disabled. In my opinion, user **should** explicit set `has_distinct_count_` when building, like:
   
   ```c++
   FinishColumnChunk() {
     if(!fallback_) {
        set_distinct_count()
     }
   }
   ```
   
   The current `has_distinct_count_` is absolutely confusing...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35989:
URL: https://github.com/apache/arrow/pull/35989#issuecomment-1611551727

   Is there any other things I need to fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1245495218


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +380,79 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {
+    EncodedStatistics encoded_statistics1;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+
+    EXPECT_FALSE(statistics1->HasMinMax());
+    EXPECT_FALSE(statistics1->HasDistinctCount());
+    EXPECT_FALSE(statistics1->HasNullCount());
+
+    EncodedStatistics encoded_statistics2;
+    encoded_statistics2.has_distinct_count = true;
+    encoded_statistics2.distinct_count = 500;
+    auto statistics2 = Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                                        /*num_values=*/1000);
+
+    EXPECT_FALSE(statistics2->HasMinMax());
+    EXPECT_TRUE(statistics2->HasDistinctCount());
+    EXPECT_FALSE(s1->HasNullCount());
+    auto s2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics2);
+
+    auto total = MakeStatistics<TestType>(this->schema_.Column(0));
+    total->Merge(*s1);
+    total->Merge(*s2);
+
+    EXPECT_FALSE(total->HasDistinctCount());
+    EXPECT_FALSE(total->HasMinMax());
+    EXPECT_EQ(2000, total->num_values());
+    EXPECT_FALSE(total->HasNullCount());
+  }
+
+  void TestNotHasNullValue() {
+    EncodedStatistics encoded_statistics1;
+    encoded_statistics1.has_null_count = false;
+    auto statistics1 = Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                                        /*num_values=*/1000);
+    auto s1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics1);
+    EXPECT_FALSE(s1->HasNullCount());
+    auto encoded = s1->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+  }
+
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+
+    {
+      auto page_statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      std::vector<uint8_t> valid_bits(
+          bit_util::BytesForBits(static_cast<uint32_t>(this->values_.size())) + 1, 0);
+      page_statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                               /*null_count*/ this->values_.size());
+      auto encoded_stats1 = page_statistics1->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+
+      chunk_statistics->Merge(*page_statistics1);
+      encoded_stats1 = chunk_statistics->Encode();
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    {
+      auto page_statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      page_statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+
+      chunk_statistics->Merge(*page_statistics2);
+      auto encoded_stats2 = chunk_statistics->Encode();
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);

Review Comment:
   Assume there a two pages, one is `[null, null, ..., null]`, so it doesn't has min-max
   The second one is `[1, 1, 1, ..., 1]`, so, finally the chunk statistics would has min-max.
   This is same as the behavior in `main` branch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1238765709


##########
cpp/src/parquet/metadata.cc:
##########
@@ -87,20 +87,29 @@ std::string ParquetVersionToString(ParquetVersion::type ver) {
 template <typename DType>
 static std::shared_ptr<Statistics> MakeTypedColumnStats(
     const format::ColumnMetaData& metadata, const ColumnDescriptor* descr) {
+  DCHECK(metadata.__isset.statistics);
+  int64_t num_non_null_values = 0;
+  if (metadata.statistics.__isset.null_count) {
+    num_non_null_values = metadata.num_values - metadata.statistics.null_count;
+  } else {
+    // NOTE: statistics should `metadata.statistics.__isset.null_count`
+    // has a invalid non-null value number.

Review Comment:
   Can a `!metadata.statistics.__isset.null_count` might means:
   1. The underlying data has null
   2. But it doesn't write it into `statistics`
   
   If it can, when `!metadata.statistics.__isset.null_count`, `metadata.num_values - metadata.statistics.null_count` cannot represent the non-null count



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1236228812


##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;

Review Comment:
   nit: add a comment here to explain why has_null_count_ and has_distinct_count_ have different defaults as we have discussed here



##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;
   }
 
+  // Currently, this function will tend to be called by testing.

Review Comment:
   Ditto, we can say it is used to create stats from provided values.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // Currently, this function will be called by ColumnWriter to create the
+  // statistics collector during write.

Review Comment:
   I'd rather not to say who will call this. Instead, we can simply say it is used to create an empty stats and update it from follow-up writes.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
     TypedStatisticsImpl::Reset();
     has_null_count_ = true;
-    has_distinct_count_ = true;
+    has_distinct_count_ = false;
   }
 
+  // Currently, this function will tend to be called by testing.
   TypedStatisticsImpl(const T& min, const T& max, int64_t num_values, int64_t null_count,
                       int64_t distinct_count)
       : pool_(default_memory_pool()),
         min_buffer_(AllocateBuffer(pool_, 0)),
         max_buffer_(AllocateBuffer(pool_, 0)) {
     TypedStatisticsImpl::IncrementNumValues(num_values);
     TypedStatisticsImpl::IncrementNullCount(null_count);
-    IncrementDistinctCount(distinct_count);
+    SetDistinctCount(distinct_count);
 
     Copy(min, &min_, min_buffer_.get());
     Copy(max, &max_, max_buffer_.get());
     has_min_max_ = true;
   }
 
+  // Currently, this function will tend to be called by ColumnChunkMetaData during read.

Review Comment:
   Ditto, we can say it is used to create stats from a thrift Statistics object.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1251803477


##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  public:
   using T = typename DType::c_type;
 
+  // This constructor would likely be called by ColumnWriter to create the
+  // statistics collector during write.

Review Comment:
   ```suggestion
     // Create an empty stats.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.
     if (other.HasNullCount()) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
-    }
+    // Distinct count cannot be merged.

Review Comment:
   ```suggestion
       // Clear has_distinct_count_ as distinct count cannot be merged.
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If a page is all nulls or nan, min and max should be set to null.
+  // And the all-null page being merged with a page having min-max,
+  // the result should have min-max.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count*/ this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics read from thrift, it might not have null_count,
+  // and page merge with it should also not have null_count.
+  void TestMergeNullCount() {
+    this->GenerateData(1000);
+
+    // Page should have null-count even if no nulls
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/this->values_.size(),
+                          /*null_count*/ 0);

Review Comment:
   ```suggestion
                             /*null_count=*/ 0);
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If a page is all nulls or nan, min and max should be set to null.
+  // And the all-null page being merged with a page having min-max,
+  // the result should have min-max.

Review Comment:
   ```suggestion
     // If all values in a page are null or nan, its stats should not set min-max.
     // Merging its stats with another page having good min-max stats should not
     // drop the valid min-max from the latter page.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -627,7 +638,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   T min_;
   T max_;
   ::arrow::MemoryPool* pool_;
-  int64_t num_values_ = 0;  // # of non-null values.
+  // Number of non-null values.
+  // num_values_ would be reliable when has_null_count_.
+  // If statistics is created from a page thrift statistics, and statistics
+  // doesn't have null_count, `num_values_` may include null values.

Review Comment:
   ```suggestion
     // Number of non-null values.
     // Please note that num_values_ is reliable when has_null_count_ is set.
     // When has_null_count_ is not set, e.g. a page statistics created from
     // a statistics thrift message which doesn't have the optional null_count, 
     // `num_values_` may include null values.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -609,9 +618,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     }
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    // Currently, distinct count would not be written.
+    // So, not call set_distinct_count.

Review Comment:
   Move these comment to line 567 above?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   Additionally, I don't think we need to add comment above. We don't know how external users call `Statistics::Merge()`. Rather we can comment that it tries to maintain every field at its best effort.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   If Merge is called twice with other.HasNullCount()=false and then other.HasNullCount()=true, here may have problem. Why not checking this->has_null_count_ first? Probably we need add a test case to cover the above case?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.
     if (other.HasNullCount()) {
       this->statistics_.null_count += other.null_count();
+    } else {
+      this->has_null_count_ = false;
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
-    }
+    // Distinct count cannot be merged.
+    has_distinct_count_ = false;
+    // If !other.HasMinMax, might be all-nulls or nulls and nan,
+    // so, not clear `this->has_min_max_` here.

Review Comment:
   ```suggestion
       // Do not clear min/max here if the other side does not provide
       // min/max which may happen when other is an empty stats or all
       // its values are null and/or NaN.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -648,6 +664,16 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     this->num_values_ = 0;
   }
 
+  void ResetHasFlags() {
+    this->has_min_max_ = false;
+    // writer will write `distinct_count` to EncodedStatistics.
+    // So, disable it when reset.
+    this->has_distinct_count_ = false;
+    // writer will always write `null_count` to EncodedStatistics.
+    // So, enable it when reset.
+    this->has_null_count_ = true;

Review Comment:
   ```suggestion
       // has_min_max_ will only be set when it meets any valid value.
       this->has_min_max_ = false;
       // has_distinct_count_ will only be set once SetDistinctCount()
       // is called because distinct count calculation is not cheap and
       // disabled by default.
       this->has_distinct_count_ = false;
       // Null count calculation is cheap and enabled by default.
       this->has_null_count_ = true;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252548387


##########
cpp/src/parquet/statistics.cc:
##########
@@ -552,12 +556,17 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
+    // Merge always runs when Merge builder's page statistics
+    // into column chunk statistics, so it usually has null count.

Review Comment:
   During test I've test that both `!has_null_count_ merge with has_null_count_` and `has_null_count_ merge with !has_null_count_`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1252551563


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,173 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> BuildMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(const TypedStatistics<TestType>& stats1,
+                              const TypedStatistics<TestType>& stats2,
+                              const std::function<void(TypedStatistics<TestType>*)>& fn) {
+    fn(BuildMergedStatistics(stats1, stats2).get());
+    fn(BuildMergedStatistics(stats2, stats1).get());
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {

Review Comment:
   Isn't here is already enough some test for other properties? I guess I can add it but it might make tests confusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1254642273


##########
cpp/src/parquet/statistics.cc:
##########
@@ -609,9 +618,10 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     }
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    // FIXME(mwish): distinct count is not encoded for now.

Review Comment:
   Created. But actually I found a lot of issues be forgotten https://github.com/apache/arrow/issues/36505



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1254600944


##########
cpp/src/parquet/statistics.cc:
##########
@@ -609,9 +618,10 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     }
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    // FIXME(mwish): distinct count is not encoded for now.

Review Comment:
   Can you open an issue for this? FIXMEs in the code will be forgotten...



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -590,6 +593,178 @@ TYPED_TEST(TestNumericStatistics, Equals) {
   ASSERT_NO_FATAL_FAILURE(this->TestEquals());
 }
 
+template <typename TestType>
+class TestStatisticsHasFlag : public TestStatistics<TestType> {
+ public:
+  void SetUp() override {
+    TestStatistics<TestType>::SetUp();
+    this->SetUpSchema(Repetition::OPTIONAL);
+  }
+
+  std::shared_ptr<TypedStatistics<TestType>> MergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2) {
+    auto chunk_statistics = MakeStatistics<TestType>(this->schema_.Column(0));
+    chunk_statistics->Merge(stats1);
+    chunk_statistics->Merge(stats2);
+    return chunk_statistics;
+  }
+
+  void VerifyMergedStatistics(
+      const TypedStatistics<TestType>& stats1, const TypedStatistics<TestType>& stats2,
+      const std::function<void(TypedStatistics<TestType>*)>& test_fn) {
+    ASSERT_NO_FATAL_FAILURE(test_fn(MergedStatistics(stats1, stats2).get()));
+    ASSERT_NO_FATAL_FAILURE(test_fn(MergedStatistics(stats2, stats1).get()));
+  }
+
+  // Distinct count should set to false when Merge is called.
+  void TestMergeDistinctCount() {
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      EncodedStatistics encoded_statistics1;
+      statistics1 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics1,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics1->HasDistinctCount());
+    }
+
+    // Create a statistics object with distinct count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_distinct_count = true;
+      encoded_statistics2.distinct_count = 500;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_TRUE(statistics2->HasDistinctCount());
+    }
+
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasDistinctCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_distinct_count);
+                           });
+  }
+
+  // If all values in a page are null or nan, its stats should not set min-max.
+  // Merging its stats with another page having good min-max stats should not
+  // drop the valid min-max from the latter page.
+  void TestMergeMinMax() {
+    this->GenerateData(1000);
+    // Create a statistics object without min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/0,
+                          /*null_count=*/this->values_.size());
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_FALSE(statistics1->HasMinMax());
+      EXPECT_FALSE(encoded_stats1.has_min);
+      EXPECT_FALSE(encoded_stats1.has_max);
+    }
+    // Create a statistics object with min-max.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      statistics2 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics2->Update(this->values_ptr_, this->values_.size(), 0);
+      auto encoded_stats2 = statistics2->Encode();
+      EXPECT_TRUE(statistics2->HasMinMax());
+      EXPECT_TRUE(encoded_stats2.has_min);
+      EXPECT_TRUE(encoded_stats2.has_max);
+    }
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasMinMax());
+                             EXPECT_TRUE(merged_statistics->Encode().has_min);
+                             EXPECT_TRUE(merged_statistics->Encode().has_max);
+                           });
+  }
+
+  // Default statistics should have null_count even if no nulls is written.
+  // However, if statistics is created from thrift message, it might not
+  // have null_count. Merging statistics from such page will result in an
+  // invalid null_count as well.
+  void TestMergeNullCount() {
+    this->GenerateData(/*num_values=*/1000);
+
+    // Page should have null-count even if no nulls
+    std::shared_ptr<TypedStatistics<TestType>> statistics1;
+    {
+      statistics1 = MakeStatistics<TestType>(this->schema_.Column(0));
+      statistics1->Update(this->values_ptr_, /*num_values=*/this->values_.size(),
+                          /*null_count=*/0);
+      auto encoded_stats1 = statistics1->Encode();
+      EXPECT_TRUE(statistics1->HasNullCount());
+      EXPECT_EQ(0, statistics1->null_count());
+      EXPECT_TRUE(statistics1->Encode().has_null_count);
+    }
+    // Merge with null-count should also have null count
+    VerifyMergedStatistics(*statistics1, *statistics1,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_TRUE(merged_statistics->HasNullCount());
+                             EXPECT_EQ(0, merged_statistics->null_count());
+                             auto encoded = merged_statistics->Encode();
+                             EXPECT_TRUE(encoded.has_null_count);
+                             EXPECT_EQ(0, encoded.null_count);
+                           });
+
+    // When loaded from thrift, might not have null count.
+    std::shared_ptr<TypedStatistics<TestType>> statistics2;
+    {
+      EncodedStatistics encoded_statistics2;
+      encoded_statistics2.has_null_count = false;
+      statistics2 = std::dynamic_pointer_cast<TypedStatistics<TestType>>(
+          Statistics::Make(this->schema_.Column(0), &encoded_statistics2,
+                           /*num_values=*/1000));
+      EXPECT_FALSE(statistics2->Encode().has_null_count);
+      EXPECT_FALSE(statistics2->HasNullCount());
+    }
+
+    // Merge without null-count should not have null count
+    VerifyMergedStatistics(*statistics1, *statistics2,
+                           [](TypedStatistics<TestType>* merged_statistics) {
+                             EXPECT_FALSE(merged_statistics->HasNullCount());
+                             EXPECT_FALSE(merged_statistics->Encode().has_null_count);
+                           });
+  }
+
+  // statistics.all_null_value is used to build the page index.
+  // If statistics doesn't have null count, all_null_value should be false.
+  void TestMissingNullCount() {
+    EncodedStatistics encoded_statistics;
+    encoded_statistics.has_null_count = false;
+    auto statistics = Statistics::Make(this->schema_.Column(0), &encoded_statistics,
+                                       /*num_values=*/1000);
+    auto typed_stats = std::dynamic_pointer_cast<TypedStatistics<TestType>>(statistics);
+    EXPECT_FALSE(typed_stats->HasNullCount());
+    auto encoded = typed_stats->Encode();
+    EXPECT_FALSE(encoded.all_null_value);
+    EXPECT_FALSE(encoded.has_null_count);
+    EXPECT_FALSE(encoded.has_distinct_count);
+    EXPECT_FALSE(encoded.has_min);
+    EXPECT_FALSE(encoded.has_max);
+  }
+};
+
+TYPED_TEST_SUITE(TestStatisticsHasFlag, Types);
+
+TYPED_TEST(TestStatisticsHasFlag, MergeDistinctCount) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeDistinctCount());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, MergeNullCount) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeNullCount());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, MergeMinMax) {
+  ASSERT_NO_FATAL_FAILURE(this->TestMergeMinMax());
+}
+
+TYPED_TEST(TestStatisticsHasFlag, MissingNullValues) {

Review Comment:
   Naming consistency
   ```suggestion
   TYPED_TEST(TestStatisticsHasFlag, MissingNullCount) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35989: GH-34351: [C++][Parquet] Statistic: tiny optimization and specification

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1242395214


##########
cpp/src/parquet/statistics.cc:
##########
@@ -610,8 +619,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    if (HasNullCount()) {
+      // num_values_ is reliable and it means number of non-null values.
+      s.all_null_value = num_values_ == 0;

Review Comment:
   Can you add a statistics test for `all_null_value`? Including cases where `num_values_` is not reliable.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -610,8 +619,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     if (HasNullCount()) {
       s.set_null_count(this->null_count());
     }
-    // num_values_ is reliable and it means number of non-null values.
-    s.all_null_value = num_values_ == 0;
+    if (HasNullCount()) {

Review Comment:
   Can you merge this with the `if` before?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -627,7 +640,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   T min_;
   T max_;
   ::arrow::MemoryPool* pool_;
-  int64_t num_values_ = 0;  // # of non-null values.
+  // # of non-null values.
+  // num_values_ would be reliable when has_null_count_,
+  // if it's created from a page thrift statistics, and statistics
+  // doesn't have null_count, but page has null data, `num_values_`
+  // would be equal to value including nulls.

Review Comment:
   This isn't very readable. Do you want to split this into two sentences?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -637,8 +655,9 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void Copy(const T& src, T* dst, ResizableBuffer*) { *dst = src; }
 
-  void IncrementDistinctCount(int64_t n) {
-    statistics_.distinct_count += n;
+  void SetDistinctCount(int64_t n) {
+    // distinct count can only be "set", and cannot be increment.

Review Comment:
   ```suggestion
       // distinct count can only be "set", and cannot be incremented.
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -627,7 +640,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   T min_;
   T max_;
   ::arrow::MemoryPool* pool_;
-  int64_t num_values_ = 0;  // # of non-null values.
+  // # of non-null values.

Review Comment:
   ```suggestion
     // Number of non-null values.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35989: GH-34351: [C++][Parquet] Statistics: add detail documentation and tiny optimization

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35989:
URL: https://github.com/apache/arrow/pull/35989


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org