You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/02/27 01:51:39 UTC

[GitHub] [arrow] wgtmac commented on a diff in pull request #34355: GH-34351: [C++][Parquet] Statistic: tiny optimization

wgtmac commented on code in PR #34355:
URL: https://github.com/apache/arrow/pull/34355#discussion_r1118201762


##########
cpp/src/parquet/statistics.cc:
##########
@@ -495,9 +495,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
                       bool has_null_count, bool has_distinct_count, MemoryPool* pool)
       : TypedStatisticsImpl(descr, pool) {
     TypedStatisticsImpl::IncrementNumValues(num_values);
+    // Currently, `has_null_count` argument is not used.
+    // Internal has_null_count_ would always be true.

Review Comment:
   IMO, if `has_null_count` is not used we should remove it to reduce confusion and future misuse. Otherwise, we should respect it from the input argument.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -620,7 +626,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  private:
   const ColumnDescriptor* descr_;
   bool has_min_max_ = false;
+  // Currently, has_null_count_ is always true, and would

Review Comment:
   This is confusing because the default value below is `false`



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType> {
     ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
   }
 
+  void TestMergeEmpty() {

Review Comment:
   It would be better to add a case where either side does not have a valid stats (min/max/null_count/distinct_count) meaning that the merged stats is dropped.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -620,7 +626,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
  private:
   const ColumnDescriptor* descr_;
   bool has_min_max_ = false;
+  // Currently, has_null_count_ is always true, and would
+  // be collected and encoded to `EncodedStatistics`.
   bool has_null_count_ = false;
+  // Currently, has_distinct_count_ would not be encoded

Review Comment:
   It seems that `statistics_` already has a set of `hasXXX` variables. Can we reuse those directly?
   
   ```cpp
   class PARQUET_EXPORT EncodedStatistics {
     std::shared_ptr<std::string> max_, min_;
     bool is_signed_ = false;
   
    public:
     EncodedStatistics()
         : max_(std::make_shared<std::string>()), min_(std::make_shared<std::string>()) {}
   
   
     int64_t null_count = 0;
     int64_t distinct_count = 0;
   
     bool has_min = false;
     bool has_max = false;
     bool has_null_count = false;
     bool has_distinct_count = false;
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -553,10 +556,13 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   void Merge(const TypedStatistics<DType>& other) override {
     this->num_values_ += other.num_values();
     if (other.HasNullCount()) {
-      this->statistics_.null_count += other.null_count();
+      this->IncrementNullCount(other.null_count());
     }
-    if (other.HasDistinctCount()) {
-      this->statistics_.distinct_count += other.distinct_count();
+    if (HasDistinctCount() && other.HasDistinctCount()) {
+      this->IncrementDistinctCount(other.distinct_count());
+    } else {
+      this->has_distinct_count_ = false;
+      this->has_distinct_count_ = 0;

Review Comment:
   ```suggestion
         statistics_.distinct_count = 0;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org