You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/05/19 17:16:58 UTC

[GitHub] [arrow] mapleFU opened a new pull request, #35691: GH-34785: [C++][Parquet] Initialize framework for Parquet bloom filter writer

mapleFU opened a new pull request, #35691:
URL: https://github.com/apache/arrow/pull/35691

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Currently we allow reading bloom filter for specific column and rowgroup, now this patch allow it writing BF.
   
   This patch is just a skeleton. If reviewer thinks interface would be OK, I'll go on and add testing.
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Allow writing bf:
   
   * [x] Add WriterProperties config for writing bloom filter, including bf and (per-rowgroup) ndv estimation.
   * [x] Add BloomFilterWriter for parquet
   * [x] From FileSerializer to ColumnWriter, adding bloomfilter
   * [x] Ensure Bloom Filter info is written to the file
   * [ ] Testing logic above
   
   ### Are these changes tested?
   
   Currently not, after reviewers think it's ok, I'll continue testing it.
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   ### Are there any user-facing changes?
   
   User can create Bloom Filter in parquet with C++ api
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208107739


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {

Review Comment:
   I think there is already a structure we are using for indexing and in other places of offset and length, lets please use that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1306448597


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.
+  void AddBloomFilter(int32_t column_id, int64_t offset, int32_t length);
+
+  /// Get bloom filter offsets of a specific row group.
+  bool GetBloomFilterOffsets(size_t row_group_ordinal,
+                             const std::map<int32_t, Reference>** out) const;
+
+  bool empty() const { return references_.empty(); }
+
+ private:
+  std::vector<std::map<int32_t, Reference>> references_;
+};
+
+class InternalFileEncryptor;
+
+namespace schema {
+class ColumnPath;
+}
+
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  explicit BloomFilterWriter(const WriterProperties& properties)
+      : properties_(properties) {}
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Return a BloomFilter defined by `col_path`.
+  ///
+  /// * If the col_path has a bloom filter, create a BloomFilter in
+  ///  `row_group_bloom_filters_` and return.
+  /// * Otherwise, return nullptr.
+  BloomFilter* GetOrCreateBloomFilter(
+      const std::shared_ptr<schema::ColumnPath>& col_path,
+      const std::function<std::unique_ptr<BloomFilter>()>& bf_creator);
+
+  /// Serialize all bloom filters with header and bitset in the order of row group and
+  /// column id. Column encryption is not implemented yet. The side effect is that it
+  /// deletes all bloom filters after they have been flushed.
+  void WriteTo(::arrow::io::OutputStream* sink, const SchemaDescriptor* schema,

Review Comment:
   > const SchemaDescriptr& unless schema is optional (int which case behavior should be documentated).
   
   Yes I think it's a better C++ style, however, arrow parquet already uses `const SchemaDescriptor* schema` everywhere. I don't want to break that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208108804


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.
+  void AddBloomFilter(int32_t column_id, int64_t offset, int32_t length);
+
+  /// Get bloom filter offsets of a specific row group.
+  bool GetBloomFilterOffsets(size_t row_group_ordinal,
+                             const std::map<int32_t, Reference>** out) const;
+
+  bool empty() const { return references_.empty(); }
+
+ private:
+  std::vector<std::map<int32_t, Reference>> references_;
+};
+
+class InternalFileEncryptor;
+
+namespace schema {
+class ColumnPath;
+}
+
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  explicit BloomFilterWriter(const WriterProperties& properties)
+      : properties_(properties) {}
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Return a BloomFilter defined by `col_path`.
+  ///
+  /// * If the col_path has a bloom filter, create a BloomFilter in
+  ///  `row_group_bloom_filters_` and return.
+  /// * Otherwise, return nullptr.
+  BloomFilter* GetOrCreateBloomFilter(

Review Comment:
   should return be const?  please document the ownership contract (does caller take ownership or is retained by this class)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208107777


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.

Review Comment:
   its not clear what offset and length mean here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1202629579


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2299,12 +2319,112 @@ Status TypedColumnWriterImpl<FLBAType>::WriteArrowDense(
   return Status::OK();
 }
 
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const T* values,
+                                                     int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i));
+    }
+  }
+}
+
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilterSpaced(const T* values,
+                                                           int64_t num_values,
+                                                           const uint8_t* valid_bits,
+                                                           int64_t valid_bits_offset) {
+  if (bloom_filter_) {
+    ::arrow::internal::VisitSetBitRunsVoid(
+        valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t length) {
+          for (int64_t i = 0; i < length; i++) {
+            bloom_filter_->InsertHash(bloom_filter_->Hash(values + i + position));
+          }
+        });
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<FLBAType>::UpdateBloomFilter(const FLBA* values,
+                                                        int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i, descr_->type_length()));
+    }
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<FLBAType>::UpdateBloomFilterSpaced(const FLBA* values,
+                                                              int64_t num_values,
+                                                              const uint8_t* valid_bits,
+                                                              int64_t valid_bits_offset) {
+  if (bloom_filter_) {
+    ::arrow::internal::VisitSetBitRunsVoid(
+        valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t length) {
+          for (int64_t i = 0; i < length; i++) {
+            bloom_filter_->InsertHash(
+                bloom_filter_->Hash(values + i + position, descr_->type_length()));
+          }
+        });
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilter(const bool*, int64_t) {
+  DCHECK(bloom_filter_ == nullptr);
+}
+
+template <>
+void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilterSpaced(const bool*, int64_t,
+                                                                 const uint8_t*,
+                                                                 int64_t) {
+  DCHECK(bloom_filter_ == nullptr);
+}
+
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const ::arrow::Array& values) {
+  ParquetException::NYI("UpdateBloomFilter via arrow::Array only supports BYTE_ARRAY");

Review Comment:
   Why? It should be easy to implement for simple primitive types such as ints and floats...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1306443309


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {

Review Comment:
   Sure, we can reuse same structure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1202627334


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2299,12 +2319,112 @@ Status TypedColumnWriterImpl<FLBAType>::WriteArrowDense(
   return Status::OK();
 }
 
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const T* values,
+                                                     int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i));

Review Comment:
   This will be very slow in any case, because there is a double virtual indirection (first `bloom_filter_->Hash`, then `hasher_->Hash`).
   
   I think you want to add batch methods, such as `BloomFilter::InsertHashes(const double* values, int64_t num_values)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208109291


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.
+  void AddBloomFilter(int32_t column_id, int64_t offset, int32_t length);
+
+  /// Get bloom filter offsets of a specific row group.
+  bool GetBloomFilterOffsets(size_t row_group_ordinal,
+                             const std::map<int32_t, Reference>** out) const;
+
+  bool empty() const { return references_.empty(); }
+
+ private:
+  std::vector<std::map<int32_t, Reference>> references_;
+};
+
+class InternalFileEncryptor;
+
+namespace schema {
+class ColumnPath;
+}
+
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  explicit BloomFilterWriter(const WriterProperties& properties)
+      : properties_(properties) {}
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Return a BloomFilter defined by `col_path`.
+  ///
+  /// * If the col_path has a bloom filter, create a BloomFilter in
+  ///  `row_group_bloom_filters_` and return.
+  /// * Otherwise, return nullptr.
+  BloomFilter* GetOrCreateBloomFilter(
+      const std::shared_ptr<schema::ColumnPath>& col_path,
+      const std::function<std::unique_ptr<BloomFilter>()>& bf_creator);
+
+  /// Serialize all bloom filters with header and bitset in the order of row group and
+  /// column id. Column encryption is not implemented yet. The side effect is that it
+  /// deletes all bloom filters after they have been flushed.
+  void WriteTo(::arrow::io::OutputStream* sink, const SchemaDescriptor* schema,

Review Comment:
   const SchemaDescriptr& unless schema is optional (int which case behavior should be documentated).
   
   Typically output parameters should be the last arguments (e.g. sink).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35691: GH-34785: [C++][Parquet] Initialize framework for Parquet bloom filter writer

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35691:
URL: https://github.com/apache/arrow/pull/35691#issuecomment-1555928777

   Thanks for working on this! I will take a look next week after returning from vacation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1203295958


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2299,12 +2319,112 @@ Status TypedColumnWriterImpl<FLBAType>::WriteArrowDense(
   return Status::OK();
 }
 
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const T* values,
+                                                     int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i));
+    }
+  }
+}
+
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilterSpaced(const T* values,
+                                                           int64_t num_values,
+                                                           const uint8_t* valid_bits,
+                                                           int64_t valid_bits_offset) {
+  if (bloom_filter_) {
+    ::arrow::internal::VisitSetBitRunsVoid(
+        valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t length) {
+          for (int64_t i = 0; i < length; i++) {
+            bloom_filter_->InsertHash(bloom_filter_->Hash(values + i + position));
+          }
+        });
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<FLBAType>::UpdateBloomFilter(const FLBA* values,
+                                                        int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i, descr_->type_length()));
+    }
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<FLBAType>::UpdateBloomFilterSpaced(const FLBA* values,
+                                                              int64_t num_values,
+                                                              const uint8_t* valid_bits,
+                                                              int64_t valid_bits_offset) {
+  if (bloom_filter_) {
+    ::arrow::internal::VisitSetBitRunsVoid(
+        valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t length) {
+          for (int64_t i = 0; i < length; i++) {
+            bloom_filter_->InsertHash(
+                bloom_filter_->Hash(values + i + position, descr_->type_length()));
+          }
+        });
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilter(const bool*, int64_t) {
+  DCHECK(bloom_filter_ == nullptr);
+}
+
+template <>
+void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilterSpaced(const bool*, int64_t,
+                                                                 const uint8_t*,
+                                                                 int64_t) {
+  DCHECK(bloom_filter_ == nullptr);
+}
+
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const ::arrow::Array& values) {
+  ParquetException::NYI("UpdateBloomFilter via arrow::Array only supports BYTE_ARRAY");

Review Comment:
   Yeah I'll finish it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35691:
URL: https://github.com/apache/arrow/pull/35691#issuecomment-1559803786

   @mapleFU I don't have time for a comprehensive review right now, I just took a very quick look. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35691:
URL: https://github.com/apache/arrow/pull/35691#issuecomment-1560667535

   @pitrou I move batch exec in a separated patch: https://github.com/apache/arrow/pull/35731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35691: GH-34785: [C++][Parquet] Initialize framework for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35691:
URL: https://github.com/apache/arrow/pull/35691#issuecomment-1554996999

   Some considerations:
   
   1. Bool will not written bloom filter. Standard and Java code allow float/doubel writing bloom filter, so I allow them, but I think they're bad idea.
   2. Parquet-mr support not writting bloom filter if Dictionary is set. I think when dictionary and bloom filter are different things, so, I allow writting bloom filter when dictionary is enabled.
   3. Written values in bloom filter need to write non-null values, so I put them into `ColumnWriterImpl`, hope it would not be ugly.
   
   @pitrou @wjones127 @wgtmac @emkornfield Mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1306377289


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -335,6 +354,7 @@ class FileSerializer : public ParquetFileWriter::Contents {
       row_group_writer_.reset();
 
       WritePageIndex();
+      WriteBloomFilter();

Review Comment:
   May I ask that why bloom filter data should before PageIndex?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1203293016


##########
cpp/src/parquet/bloom_filter_writer.cc:
##########
@@ -0,0 +1,89 @@
+#include "bloom_filter_writer.h"

Review Comment:
   ```suggestion
   #include "parquet/bloom_filter_writer.h"
   ```



##########
cpp/src/parquet/file_writer.cc:
##########
@@ -290,14 +294,29 @@ class RowGroupSerializer : public RowGroupWriter::Contents {
       auto oi_builder = page_index_builder_ && properties_->page_index_enabled(path)
                             ? page_index_builder_->GetOffsetIndexBuilder(column_ordinal)
                             : nullptr;
+      auto bloom_filter =
+          bloom_filter_writer_ && properties_->bloom_filter_enabled(path)
+              ? bloom_filter_writer_->GetOrCreateBloomFilter(
+                    path,
+                    [&]() {
+                      int64_t ndv = properties_->bloom_filter_ndv(path);
+
+                      auto bloom_filter = std::make_unique<BlockSplitBloomFilter>();
+                      // TODO(mwish): add configurable ndv
+                      bloom_filter->Init(BlockSplitBloomFilter::OptimalNumOfBytes(
+                          static_cast<uint32_t>(ndv),
+                          /*fpp=*/0.05));

Review Comment:
   `fpp` worth to be a config parameter.



##########
cpp/src/parquet/file_writer.cc:
##########
@@ -335,6 +354,7 @@ class FileSerializer : public ParquetFileWriter::Contents {
       row_group_writer_.reset();
 
       WritePageIndex();
+      WriteBloomFilter();

Review Comment:
   According to the [specs](https://github.com/apache/parquet-format/blob/master/BloomFilter.md),
   - The Bloom filter data can be stored before the page indexes after all row groups.
   - Or it can be stored between row groups
   
   So `WriteBloomFilter()` need to be placed above `WritePageIndex()` 



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2299,12 +2319,112 @@ Status TypedColumnWriterImpl<FLBAType>::WriteArrowDense(
   return Status::OK();
 }
 
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const T* values,
+                                                     int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i));
+    }
+  }
+}
+
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilterSpaced(const T* values,
+                                                           int64_t num_values,
+                                                           const uint8_t* valid_bits,
+                                                           int64_t valid_bits_offset) {
+  if (bloom_filter_) {
+    ::arrow::internal::VisitSetBitRunsVoid(
+        valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t length) {
+          for (int64_t i = 0; i < length; i++) {
+            bloom_filter_->InsertHash(bloom_filter_->Hash(values + i + position));
+          }
+        });
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<FLBAType>::UpdateBloomFilter(const FLBA* values,
+                                                        int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i, descr_->type_length()));
+    }
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<FLBAType>::UpdateBloomFilterSpaced(const FLBA* values,
+                                                              int64_t num_values,
+                                                              const uint8_t* valid_bits,
+                                                              int64_t valid_bits_offset) {
+  if (bloom_filter_) {
+    ::arrow::internal::VisitSetBitRunsVoid(
+        valid_bits, valid_bits_offset, num_values, [&](int64_t position, int64_t length) {
+          for (int64_t i = 0; i < length; i++) {
+            bloom_filter_->InsertHash(
+                bloom_filter_->Hash(values + i + position, descr_->type_length()));
+          }
+        });
+  }
+}
+
+template <>
+void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilter(const bool*, int64_t) {
+  DCHECK(bloom_filter_ == nullptr);
+}
+
+template <>
+void TypedColumnWriterImpl<BooleanType>::UpdateBloomFilterSpaced(const bool*, int64_t,
+                                                                 const uint8_t*,
+                                                                 int64_t) {
+  DCHECK(bloom_filter_ == nullptr);
+}
+
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const ::arrow::Array& values) {
+  ParquetException::NYI("UpdateBloomFilter via arrow::Array only supports BYTE_ARRAY");

Review Comment:
   I guess only BYTE_ARRAY type will issue this call.



##########
cpp/src/parquet/file_writer.cc:
##########
@@ -290,14 +294,29 @@ class RowGroupSerializer : public RowGroupWriter::Contents {
       auto oi_builder = page_index_builder_ && properties_->page_index_enabled(path)
                             ? page_index_builder_->GetOffsetIndexBuilder(column_ordinal)
                             : nullptr;
+      auto bloom_filter =
+          bloom_filter_writer_ && properties_->bloom_filter_enabled(path)
+              ? bloom_filter_writer_->GetOrCreateBloomFilter(
+                    path,
+                    [&]() {
+                      int64_t ndv = properties_->bloom_filter_ndv(path);
+
+                      auto bloom_filter = std::make_unique<BlockSplitBloomFilter>();
+                      // TODO(mwish): add configurable ndv
+                      bloom_filter->Init(BlockSplitBloomFilter::OptimalNumOfBytes(
+                          static_cast<uint32_t>(ndv),

Review Comment:
   So here we imply the ndv cannot exceed UINT32_MAX? Then we probably need to change its type in the writer properties.



##########
cpp/src/parquet/bloom_filter_writer.cc:
##########
@@ -0,0 +1,89 @@
+#include "bloom_filter_writer.h"
+
+#include "arrow/io/interfaces.h"
+
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/internal_file_encryptor.h"
+#include "parquet/exception.h"
+#include "parquet/properties.h"
+
+namespace parquet {
+
+void RowGroupBloomFilterReference::AppendRowGroup() { references_.emplace_back(); }
+
+void RowGroupBloomFilterReference::AddBloomFilter(int32_t column_id, int64_t offset,
+                                                  int32_t length) {
+  DCHECK(!references_.empty());
+  references_.back().emplace(column_id, Reference{offset, length});
+}
+
+bool RowGroupBloomFilterReference::GetBloomFilterOffsets(
+    size_t row_group_ordinal, const std::map<int32_t, Reference>** out) const {
+  if (row_group_ordinal < references_.size()) {
+    *out = &references_.at(row_group_ordinal);
+    return true;
+  }
+  *out = nullptr;
+  return false;
+}
+
+void BloomFilterWriter::AppendRowGroup() { row_group_bloom_filters_.emplace_back(); }
+
+void BloomFilterWriter::WriteTo(::arrow::io::OutputStream* sink,
+                                const SchemaDescriptor* schema,
+                                InternalFileEncryptor* encryptor) {
+  if (row_group_bloom_filters_.empty()) {
+    // Return quickly if there is no bloom filter
+    return;
+  }
+
+  for (const auto& row_group_bloom_filters : row_group_bloom_filters_) {
+    reference_.AppendRowGroup();
+
+    // the whole row group has no bloom filter
+    if (row_group_bloom_filters.empty()) {
+      continue;
+    }
+
+    // serialize bloom filter by ascending order of column id
+    for (int32_t column_id = 0; column_id < schema->num_columns(); ++column_id) {
+      const auto column_path = schema->Column(column_id)->path()->ToDotString();
+      auto meta_encryptor =
+          encryptor ? encryptor->GetColumnMetaEncryptor(column_path) : nullptr;
+      auto data_encryptor =
+          encryptor ? encryptor->GetColumnDataEncryptor(column_path) : nullptr;
+      if (meta_encryptor || data_encryptor) {
+        ParquetException::NYI("Bloom filter encryption is not implemented");
+      }
+
+      auto iter = row_group_bloom_filters.find(column_path);
+      if (iter != row_group_bloom_filters.cend()) {
+        PARQUET_ASSIGN_OR_THROW(int64_t offset, sink->Tell());
+        iter->second->WriteTo(sink);
+        PARQUET_ASSIGN_OR_THROW(int64_t pos, sink->Tell());
+        reference_.AddBloomFilter(column_id, offset, static_cast<int32_t>(pos - offset));
+      }
+    }
+  }
+
+  // release memory actively
+  row_group_bloom_filters_.clear();
+}
+
+const RowGroupBloomFilterReference& BloomFilterWriter::reference() const {
+  return reference_;
+}
+
+BloomFilter* BloomFilterWriter::GetOrCreateBloomFilter(

Review Comment:
   It seems that this function always serves `Create` but no `Get`.



##########
cpp/src/parquet/file_writer.cc:
##########
@@ -335,6 +354,7 @@ class FileSerializer : public ParquetFileWriter::Contents {
       row_group_writer_.reset();
 
       WritePageIndex();
+      WriteBloomFilter();

Review Comment:
   BTW, if they are stored between row groups, it would help relax memory pressure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1203295425


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2299,12 +2319,112 @@ Status TypedColumnWriterImpl<FLBAType>::WriteArrowDense(
   return Status::OK();
 }
 
+template <typename DType>
+void TypedColumnWriterImpl<DType>::UpdateBloomFilter(const T* values,
+                                                     int64_t num_values) {
+  if (bloom_filter_) {
+    for (int64_t i = 0; i < num_values; ++i) {
+      bloom_filter_->InsertHash(bloom_filter_->Hash(values + i));

Review Comment:
   Emmmm, I'll try to seperate a patch for batch methods



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208109405


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.
+  void AddBloomFilter(int32_t column_id, int64_t offset, int32_t length);
+
+  /// Get bloom filter offsets of a specific row group.
+  bool GetBloomFilterOffsets(size_t row_group_ordinal,
+                             const std::map<int32_t, Reference>** out) const;
+
+  bool empty() const { return references_.empty(); }
+
+ private:
+  std::vector<std::map<int32_t, Reference>> references_;
+};
+
+class InternalFileEncryptor;
+
+namespace schema {
+class ColumnPath;
+}
+
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  explicit BloomFilterWriter(const WriterProperties& properties)
+      : properties_(properties) {}
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Return a BloomFilter defined by `col_path`.
+  ///
+  /// * If the col_path has a bloom filter, create a BloomFilter in
+  ///  `row_group_bloom_filters_` and return.
+  /// * Otherwise, return nullptr.
+  BloomFilter* GetOrCreateBloomFilter(
+      const std::shared_ptr<schema::ColumnPath>& col_path,
+      const std::function<std::unique_ptr<BloomFilter>()>& bf_creator);
+
+  /// Serialize all bloom filters with header and bitset in the order of row group and
+  /// column id. Column encryption is not implemented yet. The side effect is that it
+  /// deletes all bloom filters after they have been flushed.
+  void WriteTo(::arrow::io::OutputStream* sink, const SchemaDescriptor* schema,
+               InternalFileEncryptor* encryptor = nullptr);
+
+  /// Return reference to serialized bloom filters. It is undefined behavior until
+  /// WriteTo() has been called.
+  const RowGroupBloomFilterReference& reference() const;
+
+ private:
+  const WriterProperties& properties_;
+
+  std::vector<std::map<std::string, std::unique_ptr<BloomFilter>>>

Review Comment:
   are BloomFilter value types (i.e. should this be a unique ptr  or plain old object.
   
   please document what std::string means in the map.  and what each element in the vector represents.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208108924


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.
+  void AddBloomFilter(int32_t column_id, int64_t offset, int32_t length);
+
+  /// Get bloom filter offsets of a specific row group.
+  bool GetBloomFilterOffsets(size_t row_group_ordinal,
+                             const std::map<int32_t, Reference>** out) const;
+
+  bool empty() const { return references_.empty(); }
+
+ private:
+  std::vector<std::map<int32_t, Reference>> references_;
+};
+
+class InternalFileEncryptor;
+
+namespace schema {
+class ColumnPath;
+}
+
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  explicit BloomFilterWriter(const WriterProperties& properties)
+      : properties_(properties) {}
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Return a BloomFilter defined by `col_path`.
+  ///
+  /// * If the col_path has a bloom filter, create a BloomFilter in
+  ///  `row_group_bloom_filters_` and return.
+  /// * Otherwise, return nullptr.
+  BloomFilter* GetOrCreateBloomFilter(
+      const std::shared_ptr<schema::ColumnPath>& col_path,

Review Comment:
   prefer `const schema::ColumnPath&` unless this is optional.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208107579


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -101,6 +101,12 @@ class PARQUET_EXPORT BloomFilter {
   /// @return hash result.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  // Variant of const pointer argument to facilitate template

Review Comment:
   instead of doing these, could we make const-ref parameters for the pointer types?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1306446774


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.
+  void AddBloomFilter(int32_t column_id, int64_t offset, int32_t length);
+
+  /// Get bloom filter offsets of a specific row group.
+  bool GetBloomFilterOffsets(size_t row_group_ordinal,
+                             const std::map<int32_t, Reference>** out) const;
+
+  bool empty() const { return references_.empty(); }
+
+ private:
+  std::vector<std::map<int32_t, Reference>> references_;
+};
+
+class InternalFileEncryptor;
+
+namespace schema {
+class ColumnPath;
+}
+
+class PARQUET_EXPORT BloomFilterWriter {
+ public:
+  explicit BloomFilterWriter(const WriterProperties& properties)
+      : properties_(properties) {}
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Return a BloomFilter defined by `col_path`.
+  ///
+  /// * If the col_path has a bloom filter, create a BloomFilter in
+  ///  `row_group_bloom_filters_` and return.
+  /// * Otherwise, return nullptr.
+  BloomFilter* GetOrCreateBloomFilter(
+      const std::shared_ptr<schema::ColumnPath>& col_path,

Review Comment:
   I've changed to `column_ordinal` for leaf columns



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35691:
URL: https://github.com/apache/arrow/pull/35691#issuecomment-1559639921

   @pitrou Would you mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35691: GH-34785: [C++][Parquet] Initialize framework for Parquet bloom filter writer

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35691:
URL: https://github.com/apache/arrow/pull/35691#issuecomment-1554994800

   * Closes: #34785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU closed pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU closed pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer
URL: https://github.com/apache/arrow/pull/35691


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1306443284


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -101,6 +101,12 @@ class PARQUET_EXPORT BloomFilter {
   /// @return hash result.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  // Variant of const pointer argument to facilitate template

Review Comment:
   It means that we wrap it in a template or "PhysicalType"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #35691: GH-34785: [C++][Parquet] Skeleton for Parquet bloom filter writer

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #35691:
URL: https://github.com/apache/arrow/pull/35691#discussion_r1208107846


##########
cpp/src/parquet/bloom_filter_writer.h:
##########
@@ -0,0 +1,74 @@
+#pragma once
+
+#include "bloom_filter.h"
+
+#include <map>
+#include <vector>
+
+namespace parquet {
+
+class PARQUET_EXPORT RowGroupBloomFilterReference {
+ public:
+  struct Reference {
+    int64_t offset;
+    int32_t length;
+  };
+
+  /// Append a new row group to host all incoming bloom filters.
+  void AppendRowGroup();
+
+  /// Add reference to the serialized bloom filter.
+  void AddBloomFilter(int32_t column_id, int64_t offset, int32_t length);
+
+  /// Get bloom filter offsets of a specific row group.
+  bool GetBloomFilterOffsets(size_t row_group_ordinal,
+                             const std::map<int32_t, Reference>** out) const;

Review Comment:
   why the double pointer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org