You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/03/10 08:24:10 UTC

[GitHub] [arrow] mapleFU opened a new pull request, #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

mapleFU opened a new pull request, #34526:
URL: https://github.com/apache/arrow/pull/34526

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   Create RLE Encoder for Boolean.
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Boolean current can only use Plain Encoder, here it support RLE.
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Create encoder
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes
   
   ### Are there any user-facing changes?
   
   Yes, user can use new kind of encoding.
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1463442576

   :warning: GitHub issue #15107 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1477990931

   ```
    55/113 Test  #56: arrow-dataset-scanner-test ................   Passed    4.65 sec
           Start  57: arrow-dataset-file-csv-test
    56/113 Test  #46: arrow-threading-utility-test ..............   Passed   31.67 sec
           Start  58: arrow-dataset-file-json-test
    57/113 Test  #57: arrow-dataset-file-csv-test ...............***Failed   27.37 sec
   Running main() from D:\bld\gtest_1679003944036\work\googletest\src\gtest_main.cc
   [==========] Running 176 tests from 12 test suites.
   [----------] Global test environment set-up.
   ```
   
   Seems that https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/46564157 is unrelated to my patch. Can someone help rerun that case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1143228128


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2883,7 +2989,26 @@ class RleBooleanDecoder : public DecoderImpl, virtual public BooleanDecoder {
   int DecodeArrow(int num_values, int null_count, const uint8_t* valid_bits,
                   int64_t valid_bits_offset,
                   typename EncodingTraits<BooleanType>::Accumulator* out) override {
-    ParquetException::NYI("DecodeArrow for RleBooleanDecoder");
+    if (null_count != 0) {
+      ParquetException::NYI("RleBoolean DecodeArrow with null slots");

Review Comment:
   Yes...Here implement it just for testing. And later `DecodeArrow with null slots` not have todo. I may do it later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1467262381

   @pitrou @emkornfield Mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1463448482

   cc @rok @pitrou @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138947189


##########
cpp/src/parquet/encoding.cc:
##########
@@ -512,14 +520,7 @@ class DictEncoderImpl : public EncoderImpl, virtual public DictEncoder<DType> {
   /// Returns a conservative estimate of the number of bytes needed to encode the buffered
   /// indices. Used to size the buffer passed to WriteIndices().
   int64_t EstimatedDataEncodedSize() override {
-    // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
-    // reserve
-    // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
-    // but not reserving them would cause the encoder to fail.
-    return 1 +
-           ::arrow::util::RleEncoder::MaxBufferSize(
-               bit_width(), static_cast<int>(buffered_indices_.size())) +
-           ::arrow::util::RleEncoder::MinBufferSize(bit_width());
+    return RlePreserveBufferSize(static_cast<int>(buffered_indices_.size()), bit_width());

Review Comment:
   They should be same, but they're used in different encoders



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138753216


##########
cpp/src/arrow/util/rle_encoding.h:
##########
@@ -209,8 +209,7 @@ class RleEncoder {
     // bit-packed repeated value
     int min_repeated_run_size =
         1 + static_cast<int>(::arrow::bit_util::BytesForBits(bit_width));
-    int repeated_max_size = static_cast<int>(::arrow::bit_util::CeilDiv(num_values, 8)) *
-                            min_repeated_run_size;

Review Comment:
   Good catch. Removed code does seem wrong. Has this caused bugs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138753216


##########
cpp/src/arrow/util/rle_encoding.h:
##########
@@ -209,8 +209,7 @@ class RleEncoder {
     // bit-packed repeated value
     int min_repeated_run_size =
         1 + static_cast<int>(::arrow::bit_util::BytesForBits(bit_width));
-    int repeated_max_size = static_cast<int>(::arrow::bit_util::CeilDiv(num_values, 8)) *
-                            min_repeated_run_size;

Review Comment:
   Removed code does seem wrong. Has this caused bugs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139092769


##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +465,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve
+  // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
+  // but not reserving them would cause the encoder to fail.
+  return 1 + ::arrow::util::RleEncoder::MaxBufferSize(bit_width, num_values) +
+         ::arrow::util::RleEncoder::MinBufferSize(bit_width);

Review Comment:
   I supposed you extracted a new function here for clarity, but you call in multiple places now, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1478325158

   > @rok does python tests have test for boolean encoding / decoding ? I didn't found it in `python/pyarrow/tests/parquet/test_basic.py`
   
   Sorry for the late reply. `test_column_encoding` indeed does't test boolean columns in RLE encoding. It would probably make sense to add it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139111508


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,109 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override {
+    return kRleLengthInBytes + MaxRleBufferSize();
+  }
+
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder expects BooleanArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    if (values.null_count() == 0) {
+      for (int i = 0; i < boolean_array.length(); ++i) {
+        // null_count == 0, so just call Value directly is ok.
+        buffered_append_values_.push_back(boolean_array.Value(i));
+      }
+    } else {
+      PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+          *boolean_array.data(),
+          [&](bool value) {
+            buffered_append_values_.push_back(value);
+            return Status::OK();
+          },
+          []() { return Status::OK(); }));

Review Comment:
   Not handling nulls?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1476950029

   @wjones127 I'm not familiar with this part of the codebase to really have an opinion. I was asking some questions to learn more about and trying to help with the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "raulcd (via GitHub)" <gi...@apache.org>.
raulcd commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1478006661

   > Can someone help rerun that case?
   
   I've triggered a re-run
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1143238838


##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +466,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve

Review Comment:
   Pasted from `DictEncoder`, let me change it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1133365055


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }
+
+void RleBooleanEncoder::Put(const std::vector<bool>& src, int num_values) {
+  PutImpl(src, num_values);
+}
+
+template <typename SequenceType>
+void RleBooleanEncoder::PutImpl(const SequenceType& src, int num_values) {
+  for (int i = 0; i < num_values; ++i) {
+    // TODO(mwish): using iterator to batch insert?
+    buffered_append_values_.push_back(src[i]);
+  }
+}
+
+int64_t RleBooleanEncoder::EstimatedDataEncodedSize() {
+  return kRleLengthInBytes + MaxRleBufferSize();

Review Comment:
   IIUC, max size is estimated for the worse case of bit packing and min size is estimated for repeated rle. So we have a risk of over estimation for the real data here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1477909125

   I create an issue for DecodeArrow and nulls support: https://github.com/apache/arrow/issues/34660


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139114712


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,109 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override {
+    return kRleLengthInBytes + MaxRleBufferSize();
+  }
+
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder expects BooleanArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    if (values.null_count() == 0) {
+      for (int i = 0; i < boolean_array.length(); ++i) {
+        // null_count == 0, so just call Value directly is ok.
+        buffered_append_values_.push_back(boolean_array.Value(i));
+      }
+    } else {
+      PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+          *boolean_array.data(),
+          [&](bool value) {
+            buffered_append_values_.push_back(value);
+            return Status::OK();
+          },
+          []() { return Status::OK(); }));

Review Comment:
   encoder should **not** understanding num-of-nulls, they only handle non-null values.
   
   I guess I forget to test all-null pages. I'll add tests here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1476953218

   @felipecrv Just wanted to make sure you are satisfied with responses to your earlier comments before merging :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138798351


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,113 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override {
+    // FIXME(mwish): should we just use buffered_append_values_.size() / 8
+    //  or just use ::arrow::util::RleEncoder::MaxBufferSize?
+    return kRleLengthInBytes + MaxRleBufferSize();
+  }

Review Comment:
   Do one pass over the data to count the number of runs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1133111129


##########
cpp/src/parquet/encoding.cc:
##########
@@ -20,6 +20,7 @@
 #include <algorithm>
 #include <cstdint>
 #include <cstdlib>
+#include <iostream>

Review Comment:
   Remove it?



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",

Review Comment:
   ```suggestion
         throw ParquetException("RleBooleanEncoder expects BooleanArray, got ",
   ```



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;

Review Comment:
   Add one blank line below?



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }

Review Comment:
   Should we support a `const uint8_t*` variant? The decoder has that support.



##########
cpp/src/parquet/encoding_test.cc:
##########
@@ -1576,6 +1576,60 @@ TYPED_TEST(TestDeltaBitPackEncoding, NonZeroPaddedMiniblockBitWidth) {
   }
 }
 
+// ----------------------------------------------------------------------
+// Rle for Boolean encode/decode tests.
+
+class TestRleBooleanEncoding : public TestEncodingBase<BooleanType> {
+ public:
+  using c_type = bool;
+  static constexpr int TYPE = Type::BOOLEAN;
+
+  virtual void CheckRoundtrip() {
+    auto encoder = MakeTypedEncoder<BooleanType>(Encoding::RLE,
+                                                 /*use_dictionary=*/false, descr_.get());
+    auto decoder = MakeTypedDecoder<BooleanType>(Encoding::RLE, descr_.get());
+
+    encoder->Put(draws_, num_values_);
+    encode_buffer_ = encoder->FlushValues();
+
+    decoder->SetData(num_values_, encode_buffer_->data(),
+                     static_cast<int>(encode_buffer_->size()));
+    int values_decoded = decoder->Decode(decode_buf_, num_values_);
+    ASSERT_EQ(num_values_, values_decoded);
+    ASSERT_NO_FATAL_FAILURE(VerifyResults<c_type>(decode_buf_, draws_, num_values_));
+  }
+
+  void CheckRoundtripSpaced(const uint8_t* valid_bits, int64_t valid_bits_offset) {
+    auto encoder = MakeTypedEncoder<BooleanType>(Encoding::RLE,
+                                                 /*use_dictionary=*/false, descr_.get());
+    auto decoder = MakeTypedDecoder<BooleanType>(Encoding::RLE, descr_.get());
+    int null_count = 0;
+    for (auto i = 0; i < num_values_; i++) {
+      if (!bit_util::GetBit(valid_bits, valid_bits_offset + i)) {
+        null_count++;
+      }
+    }
+
+    encoder->PutSpaced(draws_, num_values_, valid_bits, valid_bits_offset);
+    encode_buffer_ = encoder->FlushValues();
+    decoder->SetData(num_values_ - null_count, encode_buffer_->data(),
+                     static_cast<int>(encode_buffer_->size()));
+    auto values_decoded = decoder->DecodeSpaced(decode_buf_, num_values_, null_count,
+                                                valid_bits, valid_bits_offset);
+    ASSERT_EQ(num_values_, values_decoded);
+    ASSERT_NO_FATAL_FAILURE(VerifyResultsSpaced<c_type>(decode_buf_, draws_, num_values_,
+                                                        valid_bits, valid_bits_offset));
+  }
+};
+
+TEST_F(TestRleBooleanEncoding, BasicRoundTrip) {
+  ASSERT_NO_FATAL_FAILURE(this->Execute(0, 0));
+  ASSERT_NO_FATAL_FAILURE(this->Execute(2000, 200));
+  ASSERT_NO_FATAL_FAILURE(this->ExecuteSpaced(

Review Comment:
   Add a case to cover arrow array as input?



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }
+
+void RleBooleanEncoder::Put(const std::vector<bool>& src, int num_values) {
+  PutImpl(src, num_values);
+}
+
+template <typename SequenceType>
+void RleBooleanEncoder::PutImpl(const SequenceType& src, int num_values) {
+  for (int i = 0; i < num_values; ++i) {
+    // TODO(mwish): using iterator to batch insert?
+    buffered_append_values_.push_back(src[i]);
+  }
+}
+
+int64_t RleBooleanEncoder::EstimatedDataEncodedSize() {
+  return kRleLengthInBytes + MaxRleBufferSize();

Review Comment:
   This may have a risk of over estimation. Could we encode it in a streaming way?



##########
cpp/src/parquet/encoding.cc:
##########
@@ -3265,6 +3380,13 @@ std::unique_ptr<Encoder> MakeEncoder(Type::type type_num, Encoding::type encodin
       default:
         throw ParquetException("DELTA_LENGTH_BYTE_ARRAY only supports BYTE_ARRAY");
     }
+  } else if (encoding == Encoding::RLE) {
+    switch (type_num) {
+      case Type::BOOLEAN:
+        return std::make_unique<RleBooleanEncoder>(descr, pool);
+      default:
+        throw ParquetException("RLE only supports BOOL within data page");

Review Comment:
   This is not correct. It also supports dictionary indices in the data page. But I think `RLE only supports BOOL` is enough here.



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),

Review Comment:
   Should we check `values.null_count()` and go to a fast path if there is no null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #34526:
URL: https://github.com/apache/arrow/pull/34526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139158319


##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +465,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve
+  // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
+  // but not reserving them would cause the encoder to fail.
+  return 1 + ::arrow::util::RleEncoder::MaxBufferSize(bit_width, num_values) +
+         ::arrow::util::RleEncoder::MinBufferSize(bit_width);

Review Comment:
   Ok, thanks for your comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138915205


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,109 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override {
+    return kRleLengthInBytes + MaxRleBufferSize();
+  }
+
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder expects BooleanArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    if (values.null_count() == 0) {
+      for (int i = 0; i < boolean_array.length(); ++i) {
+        // null_count == 0, so just call Value directly is ok.
+        buffered_append_values_.push_back(boolean_array.Value(i));
+      }
+    } else {
+      PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+          *boolean_array.data(),
+          [&](bool value) {
+            buffered_append_values_.push_back(value);
+            return Status::OK();
+          },
+          []() { return Status::OK(); }));
+    }
+  }
+
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    return RlePreserveBufferSize(static_cast<int>(buffered_append_values_.size()),
+                                 kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;

Review Comment:
   Will this be large? Should we use `ArrowPoolVector<bool>` instead?



##########
cpp/src/parquet/encoding.cc:
##########
@@ -115,7 +115,6 @@ class PlainEncoder : public EncoderImpl, virtual public TypedEncoder<DType> {
   using TypedEncoder<DType>::Put;
 
   void Put(const T* buffer, int num_values) override;
-

Review Comment:
   Keep the blank line to be consistent.



##########
cpp/src/parquet/encoding.cc:
##########
@@ -512,14 +520,7 @@ class DictEncoderImpl : public EncoderImpl, virtual public DictEncoder<DType> {
   /// Returns a conservative estimate of the number of bytes needed to encode the buffered
   /// indices. Used to size the buffer passed to WriteIndices().
   int64_t EstimatedDataEncodedSize() override {
-    // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
-    // reserve
-    // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
-    // but not reserving them would cause the encoder to fail.
-    return 1 +
-           ::arrow::util::RleEncoder::MaxBufferSize(
-               bit_width(), static_cast<int>(buffered_indices_.size())) +
-           ::arrow::util::RleEncoder::MinBufferSize(bit_width());
+    return RlePreserveBufferSize(static_cast<int>(buffered_indices_.size()), bit_width());

Review Comment:
   Consolidate `EstimatedDataEncodedSize()` with `MaxRleBufferSize()`? It seems that they are the same.



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }

Review Comment:
   We have `int RleBooleanDecoder::Decode(uint8_t* buffer, int max_values)` so it would be nice to have the encoder parity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139094627


##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +465,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve
+  // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
+  // but not reserving them would cause the encoder to fail.
+  return 1 + ::arrow::util::RleEncoder::MaxBufferSize(bit_width, num_values) +
+         ::arrow::util::RleEncoder::MinBufferSize(bit_width);

Review Comment:
   If the extract was just for the sake of clarify, creating a variable would be preferable. You can mark this comment thread as resolved. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1133454491


##########
cpp/src/parquet/encoding_test.cc:
##########
@@ -1576,6 +1576,60 @@ TYPED_TEST(TestDeltaBitPackEncoding, NonZeroPaddedMiniblockBitWidth) {
   }
 }
 
+// ----------------------------------------------------------------------
+// Rle for Boolean encode/decode tests.
+
+class TestRleBooleanEncoding : public TestEncodingBase<BooleanType> {
+ public:
+  using c_type = bool;
+  static constexpr int TYPE = Type::BOOLEAN;
+
+  virtual void CheckRoundtrip() {
+    auto encoder = MakeTypedEncoder<BooleanType>(Encoding::RLE,
+                                                 /*use_dictionary=*/false, descr_.get());
+    auto decoder = MakeTypedDecoder<BooleanType>(Encoding::RLE, descr_.get());
+
+    encoder->Put(draws_, num_values_);
+    encode_buffer_ = encoder->FlushValues();
+
+    decoder->SetData(num_values_, encode_buffer_->data(),
+                     static_cast<int>(encode_buffer_->size()));
+    int values_decoded = decoder->Decode(decode_buf_, num_values_);
+    ASSERT_EQ(num_values_, values_decoded);
+    ASSERT_NO_FATAL_FAILURE(VerifyResults<c_type>(decode_buf_, draws_, num_values_));
+  }
+
+  void CheckRoundtripSpaced(const uint8_t* valid_bits, int64_t valid_bits_offset) {
+    auto encoder = MakeTypedEncoder<BooleanType>(Encoding::RLE,
+                                                 /*use_dictionary=*/false, descr_.get());
+    auto decoder = MakeTypedDecoder<BooleanType>(Encoding::RLE, descr_.get());
+    int null_count = 0;
+    for (auto i = 0; i < num_values_; i++) {
+      if (!bit_util::GetBit(valid_bits, valid_bits_offset + i)) {
+        null_count++;
+      }
+    }
+
+    encoder->PutSpaced(draws_, num_values_, valid_bits, valid_bits_offset);
+    encode_buffer_ = encoder->FlushValues();
+    decoder->SetData(num_values_ - null_count, encode_buffer_->data(),
+                     static_cast<int>(encode_buffer_->size()));
+    auto values_decoded = decoder->DecodeSpaced(decode_buf_, num_values_, null_count,
+                                                valid_bits, valid_bits_offset);
+    ASSERT_EQ(num_values_, values_decoded);
+    ASSERT_NO_FATAL_FAILURE(VerifyResultsSpaced<c_type>(decode_buf_, draws_, num_values_,
+                                                        valid_bits, valid_bits_offset));
+  }
+};
+
+TEST_F(TestRleBooleanEncoding, BasicRoundTrip) {
+  ASSERT_NO_FATAL_FAILURE(this->Execute(0, 0));
+  ASSERT_NO_FATAL_FAILURE(this->Execute(2000, 200));
+  ASSERT_NO_FATAL_FAILURE(this->ExecuteSpaced(

Review Comment:
   Arrow case cannot decode...Let me amend a Decode later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1133410289


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }

Review Comment:
   Seems `PlainEncoder` doesn't support that? I think u8 is ok but just leave it here first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1473293665

   @rok @wjones127 PTAL :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1478839214

   Benchmark runs are scheduled for baseline = f2c392871d0ec5469d8d85a3fa7a1cea2ddc5d81 and contender = aa8a118d4862332259aa81abab4bc335dbbc7f30. aa8a118d4862332259aa81abab4bc335dbbc7f30 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e64c76a8e98440cba2705a88f60ae049...c74a8faf662941f79f50a2805c08554a/)
   [Failed :arrow_down:0.85% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f7d6046a95474f92a1f5be0f120bce72...4693700093e9455ca235273d19dbdb79/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5108c2b7786042b28a21954679fe3f11...1f5340e26dcb4befb402ff52647be679/)
   [Finished :arrow_down:0.5% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0a7e8f055c994563bc9543955265d49c...8bee61818f7b49ff95f3ba9c400db1e1/)
   Buildkite builds:
   [Finished] [`aa8a118d` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2550)
   [Failed] [`aa8a118d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2580)
   [Finished] [`aa8a118d` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2548)
   [Finished] [`aa8a118d` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2571)
   [Finished] [`f2c39287` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2549)
   [Finished] [`f2c39287` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2579)
   [Finished] [`f2c39287` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2547)
   [Finished] [`f2c39287` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2570)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139820203


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,109 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override {
+    return kRleLengthInBytes + MaxRleBufferSize();
+  }
+
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder expects BooleanArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    if (values.null_count() == 0) {
+      for (int i = 0; i < boolean_array.length(); ++i) {
+        // null_count == 0, so just call Value directly is ok.
+        buffered_append_values_.push_back(boolean_array.Value(i));
+      }
+    } else {
+      PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+          *boolean_array.data(),
+          [&](bool value) {
+            buffered_append_values_.push_back(value);
+            return Status::OK();
+          },
+          []() { return Status::OK(); }));

Review Comment:
   Nulls, in parquet, in handled by multi-level-path writer. It write rep-levels and def-levels in upper layer.
   
   Encoder only handles non-null variables. ColumnWriter may collect some stats for null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1477668261

   @rok  does python tests have test for boolean encoding / decoding ? I didn't found it in `python/pyarrow/tests/parquet/test_basic.py`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1478065364

   Thanks @raulcd 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1143437778


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2855,7 +2963,7 @@ class RleBooleanDecoder : public DecoderImpl, virtual public BooleanDecoder {
                              " (corrupt data page?)");
     }
     // Load the first 4 bytes in little-endian, which indicates the length
-    num_bytes = ::arrow::bit_util::ToLittleEndian(SafeLoadAs<uint32_t>(data));

Review Comment:
   (In fact, in little endian machine, they are just nop, and in big endian machine, they just do byteswap, but the syntax here is ambigious)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1465767244

   ```
   C:\projects\arrow\cpp\build\src\parquet\CMakeFiles\parquet_shared.dir\Unity\unity_1_cxx.cxx
   C:/projects/arrow/cpp/src/parquet/encoding.cc(2901): error C2220: the following warning is treated as an error
   C:/projects/arrow/cpp/src/parquet/encoding.cc(2901): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data
   C:/projects/arrow/cpp/src/parquet/encoding.cc(2901): warning C4244: 'return': conversion from 'int64_t' to 'int', possible loss of data
   ```
   
   ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1474149842

   @pitrou mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139090111


##########
cpp/src/arrow/util/rle_encoding.h:
##########
@@ -209,8 +209,7 @@ class RleEncoder {
     // bit-packed repeated value
     int min_repeated_run_size =
         1 + static_cast<int>(::arrow::bit_util::BytesForBits(bit_width));
-    int repeated_max_size = static_cast<int>(::arrow::bit_util::CeilDiv(num_values, 8)) *
-                            min_repeated_run_size;

Review Comment:
   Ahh. This change preserves equivalence.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138834639


##########
cpp/src/arrow/util/rle_encoding.h:
##########
@@ -209,8 +209,7 @@ class RleEncoder {
     // bit-packed repeated value
     int min_repeated_run_size =
         1 + static_cast<int>(::arrow::bit_util::BytesForBits(bit_width));
-    int repeated_max_size = static_cast<int>(::arrow::bit_util::CeilDiv(num_values, 8)) *
-                            min_repeated_run_size;

Review Comment:
   Nope, in fact, `static_cast<int>(::arrow::bit_util::CeilDiv(num_values, 8))` has occured twice, so for simplify, I rename it



##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +465,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve
+  // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
+  // but not reserving them would cause the encoder to fail.
+  return 1 + ::arrow::util::RleEncoder::MaxBufferSize(bit_width, num_values) +
+         ::arrow::util::RleEncoder::MinBufferSize(bit_width);

Review Comment:
   int64_t seems ok, but previously, the code using `RlePreserveBufferSize` never do a overflow checking. So I use int here.
   
   What do you mean by creating local variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139092769


##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +465,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve
+  // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
+  // but not reserving them would cause the encoder to fail.
+  return 1 + ::arrow::util::RleEncoder::MaxBufferSize(bit_width, num_values) +
+         ::arrow::util::RleEncoder::MinBufferSize(bit_width);

Review Comment:
   I supposes you extracted a new function here for clarity, but you call in multiple places now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1134841219


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2855,7 +2963,7 @@ class RleBooleanDecoder : public DecoderImpl, virtual public BooleanDecoder {
                              " (corrupt data page?)");
     }
     // Load the first 4 bytes in little-endian, which indicates the length
-    num_bytes = ::arrow::bit_util::ToLittleEndian(SafeLoadAs<uint32_t>(data));

Review Comment:
   I think previously here could be strange, it should be `FromLittleEndian`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1463442529

   * Closes: #15107


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1143456517


##########
cpp/src/parquet/encoding.cc:
##########
@@ -489,6 +489,10 @@ class DictEncoderImpl : public EncoderImpl, virtual public DictEncoder<DType> {
  public:
   typedef typename DType::c_type T;
 
+  /// In data page, the bit width used to encode the entry
+  /// ids stored as 1 byte (max bit width = 32).
+  constexpr static int32_t kDataPageBitWidthBytes = 1;

Review Comment:
   @wgtmac I extract that `1` to a `kDataPageBitWidthBytes` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1142829225


##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +466,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve

Review Comment:
   The format of comment looks strange.



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2883,7 +2989,26 @@ class RleBooleanDecoder : public DecoderImpl, virtual public BooleanDecoder {
   int DecodeArrow(int num_values, int null_count, const uint8_t* valid_bits,
                   int64_t valid_bits_offset,
                   typename EncodingTraits<BooleanType>::Accumulator* out) override {
-    ParquetException::NYI("DecodeArrow for RleBooleanDecoder");
+    if (null_count != 0) {
+      ParquetException::NYI("RleBoolean DecodeArrow with null slots");

Review Comment:
   Is this a TODO or misuse for `null_count != 0`?



##########
cpp/src/parquet/encoding.cc:
##########
@@ -2855,7 +2963,7 @@ class RleBooleanDecoder : public DecoderImpl, virtual public BooleanDecoder {
                              " (corrupt data page?)");
     }
     // Load the first 4 bytes in little-endian, which indicates the length
-    num_bytes = ::arrow::bit_util::ToLittleEndian(SafeLoadAs<uint32_t>(data));

Review Comment:
   Good catch



##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +466,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve
+  // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
+  // but not reserving them would cause the encoder to fail.
+  return 1 + ::arrow::util::RleEncoder::MaxBufferSize(bit_width, num_values) +

Review Comment:
   What does 1 mean here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1133417345


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }
+
+void RleBooleanEncoder::Put(const std::vector<bool>& src, int num_values) {
+  PutImpl(src, num_values);
+}
+
+template <typename SequenceType>
+void RleBooleanEncoder::PutImpl(const SequenceType& src, int num_values) {
+  for (int i = 0; i < num_values; ++i) {
+    // TODO(mwish): using iterator to batch insert?
+    buffered_append_values_.push_back(src[i]);
+  }
+}
+
+int64_t RleBooleanEncoder::EstimatedDataEncodedSize() {
+  return kRleLengthInBytes + MaxRleBufferSize();

Review Comment:
   Yes, the syntax for it it's really trickey. MaxSize even possible less than MinSize, and the estimation is really a mass.
   
   Though may have a risk of over estimation, DictEncoder estimate value in this way. And if we want just an estimation, we can just use `size-of-values / 8` as an estimation.
   
   Streaming Encoding needs to refactor `::arrow::util::RleEncoder`, I think it's a bit hard here. I can open an issue and do it later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1133286277


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }
+
+void RleBooleanEncoder::Put(const std::vector<bool>& src, int num_values) {
+  PutImpl(src, num_values);
+}
+
+template <typename SequenceType>
+void RleBooleanEncoder::PutImpl(const SequenceType& src, int num_values) {
+  for (int i = 0; i < num_values; ++i) {
+    // TODO(mwish): using iterator to batch insert?
+    buffered_append_values_.push_back(src[i]);
+  }
+}
+
+int64_t RleBooleanEncoder::EstimatedDataEncodedSize() {
+  return kRleLengthInBytes + MaxRleBufferSize();

Review Comment:
   Why, since it's just rle estimations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138787135


##########
cpp/src/parquet/encoding.cc:
##########
@@ -466,6 +465,15 @@ struct DictEncoderTraits<FLBAType> {
 // Initially 1024 elements
 static constexpr int32_t kInitialHashTableSize = 1 << 10;
 
+int RlePreserveBufferSize(int num_values, int bit_width) {
+  // Note: because of the way RleEncoder::CheckBufferFull() is called, we have to
+  // reserve
+  // an extra "RleEncoder::MinBufferSize" bytes. These extra bytes won't be used
+  // but not reserving them would cause the encoder to fail.
+  return 1 + ::arrow::util::RleEncoder::MaxBufferSize(bit_width, num_values) +
+         ::arrow::util::RleEncoder::MinBufferSize(bit_width);

Review Comment:
   You could avoid this extraction by creating a local variable called `num_values`.
   
   Note that the caller returns `int64_t`, so you can also return `int64_t` and avoid a potential `int` overflow when summing `a + int + int`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1138944982


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,109 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override {
+    return kRleLengthInBytes + MaxRleBufferSize();
+  }
+
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder expects BooleanArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    if (values.null_count() == 0) {
+      for (int i = 0; i < boolean_array.length(); ++i) {
+        // null_count == 0, so just call Value directly is ok.
+        buffered_append_values_.push_back(boolean_array.Value(i));
+      }
+    } else {
+      PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+          *boolean_array.data(),
+          [&](bool value) {
+            buffered_append_values_.push_back(value);
+            return Status::OK();
+          },
+          []() { return Status::OK(); }));
+    }
+  }
+
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    return RlePreserveBufferSize(static_cast<int>(buffered_append_values_.size()),
+                                 kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;

Review Comment:
   Nice catch, I think it won't be too large, because it's just a bitmap. But `ArrowPoolVector<bool>` seems much better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34526:
URL: https://github.com/apache/arrow/pull/34526#discussion_r1139157279


##########
cpp/src/parquet/encoding.cc:
##########
@@ -2838,6 +2839,120 @@ class DeltaLengthByteArrayDecoder : public DecoderImpl,
   std::shared_ptr<ResizableBuffer> buffered_data_;
 };
 
+// ----------------------------------------------------------------------
+// RLE_BOOLEAN_ENCODER
+
+class RleBooleanEncoder final : public EncoderImpl, virtual public BooleanEncoder {
+ public:
+  explicit RleBooleanEncoder(const ColumnDescriptor* descr, ::arrow::MemoryPool* pool)
+      : EncoderImpl(descr, Encoding::RLE, pool) {}
+
+  int64_t EstimatedDataEncodedSize() override;
+  std::shared_ptr<Buffer> FlushValues() override;
+
+  void Put(const T* buffer, int num_values) override;
+  void Put(const ::arrow::Array& values) override {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException("RleBooleanEncoder Expected BoolTArray, got ",
+                             values.type()->ToString());
+    }
+    const auto& boolean_array = checked_cast<const ::arrow::BooleanArray&>(values);
+    PARQUET_THROW_NOT_OK(::arrow::VisitArraySpanInline<::arrow::BooleanType>(
+        *boolean_array.data(),
+        [&](bool value) {
+          buffered_append_values_.push_back(value);
+          return Status::OK();
+        },
+        []() { return Status::OK(); }));
+  }
+  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
+                 int64_t valid_bits_offset) override {
+    if (valid_bits != NULLPTR) {
+      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
+                                                                   this->memory_pool()));
+      T* data = reinterpret_cast<T*>(buffer->mutable_data());
+      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
+          src, num_values, valid_bits, valid_bits_offset, data);
+      Put(data, num_valid_values);
+    } else {
+      Put(src, num_values);
+    }
+  }
+
+  void Put(const std::vector<bool>& src, int num_values) override;
+
+ protected:
+  template <typename ArrowType>
+  void PutImpl(const ::arrow::Array& values) {
+    if (values.type_id() != ::arrow::Type::BOOL) {
+      throw ParquetException(std::string() + "direct put to " + ArrowType::type_name() +
+                             " from " + values.type()->ToString() + " not supported");
+    }
+    const auto& data = *values.data();
+    PutSpaced(data.GetValues<typename ArrowType::c_type>(1),
+              static_cast<int>(data.length), data.GetValues<uint8_t>(0, 0), data.offset);
+  }
+
+  template <typename SequenceType>
+  void PutImpl(const SequenceType& src, int num_values);
+
+  int MaxRleBufferSize() const noexcept {
+    // TODO(mwish): Encapsulate these rules.
+    return 1 +
+           ::arrow::util::RleEncoder::MaxBufferSize(
+               kBitWidth,
+               static_cast<int>(static_cast<int>(buffered_append_values_.size()))) +
+           ::arrow::util::RleEncoder::MinBufferSize(kBitWidth);
+  }
+
+  constexpr static int32_t kBitWidth = 1;
+  /// 4 bytes in little-endian, which indicates the length.
+  constexpr static int32_t kRleLengthInBytes = 4;
+
+  // std::vector<bool> in C++ is tricky, because it's a bitmap.
+  // Here RleBooleanEncoder will only append values into it, and
+  // dump values into Buffer, so using it here is ok.
+  std::vector<bool> buffered_append_values_;
+};
+
+void RleBooleanEncoder::Put(const bool* src, int num_values) { PutImpl(src, num_values); }

Review Comment:
   Currently I thinks keep it is ok. I can open a issue, and support write `PlainBooleanEncoder` and `RleBooleanEncoder` by u8 and bitmap later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34526: GH-15107: [C++][Parquet] Parquet Encoder: Support RLE for Boolean

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34526:
URL: https://github.com/apache/arrow/pull/34526#issuecomment-1476826695

   @felipecrv Does this look good to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org