You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/05/24 03:28:38 UTC

[GitHub] [arrow] mapleFU opened a new pull request, #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

mapleFU opened a new pull request, #35731:
URL: https://github.com/apache/arrow/pull/35731

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   For optimizing https://github.com/apache/arrow/pull/35691 . We need batch execution for BloomFilter
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Add `InsertHashes` and `Hashes` for BloomFilter
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   * [ ] Will do
   
   ### Are there any user-facing changes?
   
   User can use batch interface here.
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561438468

   > Seems "C++ / AMD64 Conda C++ (pull_request) Failing after 34m" is unrelated?
   
   It is, you can freely ignore it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1206777585


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t exists = values[0];
+  for (auto _ : state) {
+    bool found = filter->FindHash(exists);
+    ::benchmark::DoNotOptimize(found);
+  }
+  state.SetItemsProcessed(state.iterations());
+}
+
+static void BM_FindNotExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t unexist = values.back();

Review Comment:
   I use different seed now, though it may generate same value



##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t exists = values[0];

Review Comment:
   Changed 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561001878

   @pitrou I've add the tests, I extract `GenerateData` from `encoding_test.cc` and use it to generate all kinds of data. Mind take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204487058


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);

Review Comment:
   Yes, you could write a simple generator here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203908023


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < 32; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(32);
+  hasher_seed_0->Hashes(byte_array_vector.data(),
+                        static_cast<int>(byte_array_vector.size()), hashes.data());
+  for (int i = 0; i < 32; i++) {
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hashes[i])
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+template <typename DType>
+class TestBatchBloomFilter : public testing::Test {
+ public:
+  using T = typename DType::c_type;
+
+  void SetUp() {}
+
+  void TearDown() {}

Review Comment:
   Sure, I use them to try to fixing the compile failed :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205551484


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;

Review Comment:
   Emmm should I use interleave values like `kGenerateBenchmarkDataStringLength - 1, kGenerateBenchmarkDataStringLength + 1` ... Or using other ways? I guess making FLBA and ByteArray have different sum bytes size is unfair



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205583125


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }

Review Comment:
   Yes, you could indeed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203844074


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < 32; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(32);

Review Comment:
   ```suggestion
     hashes.resize(byte_array_vector.size());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203848528


##########
cpp/src/parquet/xxhasher.cc:
##########
@@ -27,6 +27,14 @@ template <typename T>
 uint64_t XxHashHelper(T value, uint32_t seed) {
   return XXH64(reinterpret_cast<const void*>(&value), sizeof(T), seed);
 }
+
+template <typename T>
+void XxHashesHelper(const T* value, uint32_t seed, int num_values, uint64_t* results) {
+  for (int i = 0; i < num_values; ++i) {
+    results[i] = XxHashHelper(value[i], seed);
+  }

Review Comment:
   Nit
   ```suggestion
   void XxHashesHelper(const T* values, uint32_t seed, int num_values, uint64_t* results) {
     for (int i = 0; i < num_values; ++i) {
       results[i] = XxHashHelper(values[i], seed);
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204439359


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);

Review Comment:
   Please avoid magic numbers and use constants instead.
   ```suggestion
     constexpr int kNumValues = 1024;
     std::vector<T> values(1024);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203501239


##########
cpp/src/parquet/hasher.h:
##########
@@ -66,6 +66,58 @@ class Hasher {
   /// @param len the value length.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  /// Batch compute hashes for 32 bits values by using its plain encoding result.
+  ///
+  /// @param values the value to hash.
+  /// @param num_values the number of values to hash.
+  /// @param hashes the output hash value, it length should be equal to num_values.
+  virtual void Hashes(const int32_t* values, int num_values, uint64_t* hashes) const = 0;

Review Comment:
   I don't know ... They're all ok for me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1564361501

   Using 1mib as a insert batch is really too large:
   
   ```
   BM_BatchComputeHash<ByteArrayType>   85150536 ns     85024375 ns            8 items_per_second=12.3327M/s
   BM_BatchComputeHash<FLBAType>        78013056 ns     77946556 ns            9 items_per_second=13.4525M/s
   BM_BatchComputeHash<Int96Type>       95809131 ns     95748143 ns            7 items_per_second=10.9514M/s
   BM_InsertHash                       910213833 ns    904038000 ns            1 items_per_second=1.15988M/s
   BM_BatchInsertHash                  870639958 ns    866905000 ns            1 items_per_second=1.20956M/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205554310


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;

Review Comment:
   Good point. We may keep things like that as it's probably good enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1563106441

   @mapleFU No problem, and thanks for doing this :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561004727

   On my MacOS with Release(O2):
   
   ```
   ----------------------------------------------------------------------------------------
   Benchmark                              Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------------
   BM_CountHash<Int32Type>             7318 ns         7309 ns        88867 items_per_second=140.096M/s
   BM_CountHash<Int64Type>             6120 ns         6099 ns       112450 items_per_second=167.886M/s
   BM_CountHash<FloatType>             6574 ns         6507 ns       104999 items_per_second=157.375M/s
   BM_CountHash<DoubleType>            6732 ns         6726 ns       103284 items_per_second=152.254M/s
   BM_BatchCountHash<Int32Type>        2854 ns         2850 ns       245992 items_per_second=359.34M/s
   BM_BatchCountHash<Int64Type>        2640 ns         2639 ns       265086 items_per_second=388.007M/s
   BM_BatchCountHash<FloatType>        3151 ns         3149 ns       226064 items_per_second=325.222M/s
   BM_BatchCountHash<DoubleType>       2993 ns         2987 ns       235524 items_per_second=342.78M/s
   BM_InsertHash                      22694 ns        22676 ns        31097 items_per_second=45.1588M/s
   BM_InsertHash                      23941 ns        23438 ns        31044 items_per_second=43.6899M/s
   BM_BatchInsertHash                 23246 ns        23079 ns        30079 items_per_second=44.3696M/s
   BM_BatchInsertHash                 23125 ns        23057 ns        30729 items_per_second=44.4115M/s
   BM_FindHash                        17945 ns        17734 ns        39837 items_per_second=57.7415M/s
   BM_FindHash                        17298 ns        17284 ns        39557 items_per_second=59.2463M/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204466493


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), test::kGenerateDataFLBALength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->FindHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+BENCHMARK_TEMPLATE(BM_CountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_CountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_CountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_CountHash, DoubleType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, DoubleType);

Review Comment:
   Sure, will add it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1569778401

   Benchmark runs are scheduled for baseline = 9dc63c247efc587277bccaa01a44e815996a2a8e and contender = 26952930e0a1525814715190a5ef399ec0c9e383. 26952930e0a1525814715190a5ef399ec0c9e383 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f3680a5babca412b81700d7aaaada08d...b26b742163d04594add15dd80d8711ba/)
   [Failed :arrow_down:1.39% :arrow_up:0.08%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/41a4e41b34924debbd08ef1a5a2041e9...a8c0ce5564ed4ac3a176f27e1d9cf81a/)
   [Finished :arrow_down:1.62% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/74834775ba714d7c92c88ce31ee25aed...ce052d62674e4d198262f2fd9490e5b9/)
   [Finished :arrow_down:1.26% :arrow_up:0.54%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1570eb7fdaa74825bc26cb07cfb83736...99975a630ce14aa79f9470fd63b9fc1d/)
   Buildkite builds:
   [Finished] [`26952930` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2951)
   [Failed] [`26952930` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2987)
   [Finished] [`26952930` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2952)
   [Finished] [`26952930` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2977)
   [Finished] [`9dc63c24` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2950)
   [Failed] [`9dc63c24` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2986)
   [Finished] [`9dc63c24` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2951)
   [Finished] [`9dc63c24` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2976)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1570186034

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/74834775ba714d7c92c88ce31ee25aed...ce052d62674e4d198262f2fd9490e5b9/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205544962


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;

Review Comment:
   It's ok, but I guess keep size same is no problem



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560515261

   @wgtmac test added and some comment fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203894750


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   Ok, I will submit a better solution here. But why calling `InsertHash` many times is not ok? I change this filter to "final", so, compiler will try it best to optimizing `InsertHash` without calling virtual function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561309541

   ```
   104/105 Test #168: gandiva-projector-test-static .............   Passed   12.29 sec
   Errors while running CTest
   105/105 Test #124: arrow-flight-transport-ucx-test ...........***Timeout 300.09 sec
   Running arrow-flight-transport-ucx-test, redirecting output into /build/cpp/build/test-logs/arrow-flight-transport-ucx-test.txt (attempt 1/1)
   ```
   
   Seems "C++ / AMD64 Conda C++ (pull_request) Failing after 34m" is unrelated?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204448120


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), test::kGenerateDataFLBALength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->FindHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+BENCHMARK_TEMPLATE(BM_CountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_CountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_CountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_CountHash, DoubleType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, DoubleType);
+
+BENCHMARK(BM_InsertHash);
+BENCHMARK(BM_InsertHash);
+BENCHMARK(BM_BatchInsertHash);
+BENCHMARK(BM_BatchInsertHash);
+BENCHMARK(BM_FindHash);
+BENCHMARK(BM_FindHash);

Review Comment:
   Why are all these benchmarks declared twice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561616037

   ```
   BM_ComputeHash<Int32Type>                8222 ns         8222 ns        72962 items_per_second=124.547M/s
   BM_ComputeHash<Int64Type>                7993 ns         7991 ns        87555 items_per_second=128.15M/s
   BM_ComputeHash<FloatType>                8476 ns         8474 ns        82384 items_per_second=120.84M/s
   BM_ComputeHash<DoubleType>               8122 ns         8116 ns        86244 items_per_second=126.172M/s
   BM_ComputeHash<ByteArrayType>            8884 ns         8865 ns        78521 items_per_second=115.508M/s
   BM_ComputeHash<FLBAType>                 8504 ns         8501 ns        82276 items_per_second=120.46M/s
   BM_ComputeHash<Int96Type>                7174 ns         7172 ns        97079 items_per_second=142.772M/s
   BM_BatchComputeHash<Int32Type>           3321 ns         3321 ns       210445 items_per_second=308.358M/s
   BM_BatchComputeHash<Int64Type>           3127 ns         3126 ns       223877 items_per_second=327.561M/s
   BM_BatchComputeHash<FloatType>           3691 ns         3690 ns       189074 items_per_second=277.471M/s
   BM_BatchComputeHash<DoubleType>          3553 ns         3507 ns       200817 items_per_second=291.967M/s
   BM_BatchComputeHash<ByteArrayType>       6974 ns         6973 ns       100914 items_per_second=146.845M/s
   BM_BatchComputeHash<FLBAType>            6634 ns         6629 ns       104468 items_per_second=154.461M/s
   BM_BatchComputeHash<Int96Type>           4626 ns         4615 ns       151891 items_per_second=221.896M/s
   BM_InsertHash                           28646 ns        27329 ns        25853 items_per_second=37.4692M/s
   BM_BatchInsertHash                      26240 ns        26228 ns        26426 items_per_second=39.0426M/s
   BM_FindExistsHash                        20.1 ns         20.1 ns     34813920 items_per_second=49.8169M/s
   BM_FindNotExistsHash                     21.2 ns         21.2 ns     33347307 items_per_second=47.2693M/s
   ```
   
   I've implement the random generator and fix the comment, here is the new benchmark result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203918838


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   Good point. By writing benchmarks you'll be able to evaluate if the two possible implementations show different performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203963500


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < 32; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(32);
+  hasher_seed_0->Hashes(byte_array_vector.data(),
+                        static_cast<int>(byte_array_vector.size()), hashes.data());
+  for (int i = 0; i < 32; i++) {
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hashes[i])
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+template <typename DType>
+class TestBatchBloomFilter : public testing::Test {
+ public:
+  using T = typename DType::c_type;

Review Comment:
   Thanks! This works!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205528270


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;

Review Comment:
   Can we vary the string length somehow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205546509


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;

Review Comment:
   Probably not, though it may makes things easier for the CPU.



##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;

Review Comment:
   Probably not, though it may make things easier for the CPU.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1568743429

   Nice catch!
   
   Personally I guess this bloom filter would not be the bottleneck, I guess read whole bloom filter would be much more expensive, but improve the performance would be great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35731:
URL: https://github.com/apache/arrow/pull/35731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204465863


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;

Review Comment:
   Hopefully :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561007543

   You should benchmark `BloomFilter` as that's the public interface that will be called.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204442137


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;

Review Comment:
   The compiler will probably know that `filter` is really a `BlockSplitBloomFilter`? I'm not sure how to avoid this issue, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204440079


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {

Review Comment:
   Not sure why it's called "count hash"? You could call it "BM_ComputeHash", for example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205532275


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {

Review Comment:
   Nit: style
   ```suggestion
   std::unique_ptr<BloomFilter> CreateBloomFilter(uint32_t element_size) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203849082


##########
cpp/src/parquet/hasher.h:
##########
@@ -66,6 +66,58 @@ class Hasher {
   /// @param len the value length.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  /// Batch compute hashes for 32 bits values by using its plain encoding result.
+  ///
+  /// @param values the value to hash.

Review Comment:
   ```suggestion
     /// @param values a pointer to the values to hash.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204495299


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), test::kGenerateDataFLBALength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->FindHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+BENCHMARK_TEMPLATE(BM_CountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_CountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_CountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_CountHash, DoubleType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, DoubleType);
+
+BENCHMARK(BM_InsertHash);
+BENCHMARK(BM_InsertHash);
+BENCHMARK(BM_BatchInsertHash);
+BENCHMARK(BM_BatchInsertHash);
+BENCHMARK(BM_FindHash);
+BENCHMARK(BM_FindHash);

Review Comment:
   Sorry, previously I mark them as typed test, but I found that they don't need type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560769484

   I've no idea why windows CI failed on `TYPED_TEST`...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203475907


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   Emmm I guess I should mark `BlockSplitBloomFilter` as final to avoid virtual call



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560972875

   @pitrou I got a wired problem, should I use `BloomFilter` or `BlockSplitBloomFilter` in benchmark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204448560


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), test::kGenerateDataFLBALength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->FindHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+BENCHMARK_TEMPLATE(BM_CountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_CountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_CountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_CountHash, DoubleType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int32Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, Int64Type);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, FloatType);
+BENCHMARK_TEMPLATE(BM_BatchCountHash, DoubleType);

Review Comment:
   Can we also declare benchmarks with ByteArray and FixedLenByteArray types?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205540517


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }

Review Comment:
   Several things:
   * I think we should create a new bloom filter for each benchmark iteration
   * `ClobberMemory` should be moved after the inner loop, as it acts as a memory barrier and could pessimize performance here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205548499


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;

Review Comment:
   Actually, `DEFAULT_MAX_ROW_GROUP_LENGTH` is `1024 * 1024` currently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205566281


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;

Review Comment:
   Emmm should I use 1mib rows here? Seems it would be extremly huge



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205531978


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;

Review Comment:
   I'm curious, is this a typical bloom filter size? Given that bloom filters are grouped by row group, I would perhaps expect a larger value? (row groups are typically quite large)
   



##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {

Review Comment:
   Nit: style
   ```suggestion
   std::unique_ptr<BloomFilter> CreateBloomFilter(uint32_t elementSize) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203907116


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   Yes, but now `InsertHashes` is a member of `BlockSplitBloomFilter`, not implemented in `BloomFilter`(base class)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204494590


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);

Review Comment:
   Is there any tools or reference I can use? Maybe I can avoid writing some duplicate generate data code here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560408891

   :warning: GitHub issue #35729 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203844892


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};

Review Comment:
   Please use some kind of const instead of magic numbers.
   ```suggestion
     constexpr int kNumValues = 32;
     uint8_t bytes[kNumValues] = {};
   ```



##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};

Review Comment:
   Please use some kind of constant instead of magic numbers.
   ```suggestion
     constexpr int kNumValues = 32;
     uint8_t bytes[kNumValues] = {};
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203864108


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < 32; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(32);
+  hasher_seed_0->Hashes(byte_array_vector.data(),
+                        static_cast<int>(byte_array_vector.size()), hashes.data());
+  for (int i = 0; i < 32; i++) {
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hashes[i])
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+template <typename DType>
+class TestBatchBloomFilter : public testing::Test {
+ public:
+  using T = typename DType::c_type;

Review Comment:
   Perhaps you need to name this something else than `T` if the MSVC compile error persists?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204443467


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));

Review Comment:
   Can you factor out bloom filter creation instead of copy-and-paste?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204438685


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);

Review Comment:
   I don't think the benchmarks should use `test::GenerateData`, because we may decide to change the generation algorithm when enhancing the test suite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205544218


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t exists = values[0];

Review Comment:
   Finding always the same value is a bit optimistic IMHO, as it may hit a best case in terms of branch prediction and CPU cache locality. You could insert iterate over `hashes`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1562982716

   I'm a bit tired tonight and I'll finish fixing the comments tomorrow :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204442683


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {

Review Comment:
   Why not use a foreach loop?
   ```suggestion
       for (const auto& value : values) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560885457

   Sure, I'll add benchmark for them!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204453631


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -323,17 +324,113 @@ const int64_t HASHES_OF_LOOPING_BYTES_WITH_SEED_0[32] = {
  * }
  */
 TEST(XxHashTest, TestBloomFilter) {
-  uint8_t bytes[32] = {};
+  constexpr int kNumValues = 32;
+  uint8_t bytes[kNumValues] = {};
 
-  for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+  for (int i = 0; i < kNumValues; i++) {
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  constexpr int kNumValues = 32;
+  uint8_t bytes[kNumValues] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < kNumValues; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(kNumValues);
+  hasher_seed_0->Hashes(byte_array_vector.data(),
+                        static_cast<int>(byte_array_vector.size()), hashes.data());
+  for (int i = 0; i < kNumValues; i++) {
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hashes[i])
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+template <typename DType>
+class TestBatchBloomFilter : public testing::Test {
+ public:
+  constexpr static int kTestDataSize = 64;
+
+  // GenerateTestData with size 64.
+  std::vector<typename DType::c_type> GenerateTestData();
+
+  // The Lifetime owner for Test data
+  std::vector<uint8_t> members;
+};
+
+template <typename DType>
+std::vector<typename DType::c_type> TestBatchBloomFilter<DType>::GenerateTestData() {
+  std::vector<typename DType::c_type> values(kTestDataSize);
+  GenerateData(kTestDataSize, values.data(), &members);
+  return values;
+}
+
+// Note: BloomFilter doesn't support BooleanType.
+using BloomFilterTestTypes = ::testing::Types<Int32Type, Int64Type, FloatType, DoubleType,
+                                              Int96Type, FLBAType, ByteArrayType>;
+
+TYPED_TEST_SUITE(TestBatchBloomFilter, BloomFilterTestTypes);
+
+TYPED_TEST(TestBatchBloomFilter, Basic) {
+  using Type = typename TypeParam::c_type;
+  std::vector<Type> test_data = TestFixture::GenerateTestData();
+  BlockSplitBloomFilter batch_insert_filter;
+  BlockSplitBloomFilter filter;
+
+  // Bloom filter fpp parameter
+  const double fpp = 0.05;
+  filter.Init(BlockSplitBloomFilter::OptimalNumOfBytes(TestFixture::kTestDataSize, fpp));
+  batch_insert_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(TestFixture::kTestDataSize, fpp));
+
+  std::vector<uint64_t> hashes;
+  for (int i = 0; i < static_cast<int>(test_data.size()); ++i) {

Review Comment:
   Why not a foreach loop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203478771


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,11 +326,31 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+// Same as TestBloomFilter but using Batch interface

Review Comment:
   I'm adding it, wait a minute



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203469606


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -49,6 +49,11 @@ class PARQUET_EXPORT BloomFilter {
   /// @param hash the hash of value to insert into Bloom filter.
   virtual void InsertHash(uint64_t hash) = 0;
 
+  /// Insert element to set represented by Bloom filter bitset.

Review Comment:
   ```suggestion
     /// Insert elements to set represented by Bloom filter bitset.
   ```



##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,11 +326,31 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+// Same as TestBloomFilter but using Batch interface

Review Comment:
   Would be good to cover different types, though it is missing in the original test...



##########
cpp/src/parquet/hasher.h:
##########
@@ -66,6 +66,58 @@ class Hasher {
   /// @param len the value length.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  /// Batch compute hashes for 32 bits values by using its plain encoding result.
+  ///
+  /// @param values the value to hash.
+  /// @param num_values the number of values to hash.
+  /// @param hashes the output hash value, it length should be equal to num_values.
+  virtual void Hashes(const int32_t* values, int num_values, uint64_t* hashes) const = 0;

Review Comment:
   Is `BatchHash` a better name?



##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -49,6 +49,11 @@ class PARQUET_EXPORT BloomFilter {
   /// @param hash the hash of value to insert into Bloom filter.
   virtual void InsertHash(uint64_t hash) = 0;
 
+  /// Insert element to set represented by Bloom filter bitset.
+  /// @param hashes the hash of value to insert into Bloom filter.

Review Comment:
   ```suggestion
     /// @param hashes the hash values to insert into Bloom filter.
   ```



##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   Implement a `InsertHashInternal` to avoid multiple virtual calls here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203845995


##########
cpp/src/parquet/hasher.h:
##########
@@ -66,6 +66,58 @@ class Hasher {
   /// @param len the value length.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  /// Batch compute hashes for 32 bits values by using its plain encoding result.
+  ///
+  /// @param values the value to hash.
+  /// @param num_values the number of values to hash.
+  /// @param hashes the output hash value, it length should be equal to num_values.
+  virtual void Hashes(const int32_t* values, int num_values, uint64_t* hashes) const = 0;

Review Comment:
   `Hashes` is ok to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204108319


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   I change it to calling a non-virtual function, but performance is same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204487770


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);

Review Comment:
   This is desired here because bloom filter performance can depend on the input (at least the `FindHash` function).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205572331


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }

Review Comment:
   Okay, Seems that I should create BF and then start the clock like:
   
   ```c++
     for (auto _ : state) {
       state.PauseTiming();
       data = CreateBloomFilterAndInsertData();
       state.ResumeTiming();
       for (int j = 0; j < state.range(1); ++j)
         ...
     }
   ```
   
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1568731905

   Thanks @mapleFU . I've pushed some changes to improve some benchmarks and also to improve `BloomFilter::Find` performance (around 2x faster).
   
   Current benchmarks here:
   ````
   ---------------------------------------------------------------------------------------------
   Benchmark                                   Time             CPU   Iterations UserCounters...
   ---------------------------------------------------------------------------------------------
   BM_ComputeHash<Int32Type>              715521 ns       715427 ns          951 items_per_second=22.901M/s
   BM_ComputeHash<Int64Type>              740955 ns       740872 ns          878 items_per_second=22.1145M/s
   BM_ComputeHash<FloatType>              720903 ns       720789 ns          946 items_per_second=22.7306M/s
   BM_ComputeHash<DoubleType>             691569 ns       691479 ns         1005 items_per_second=23.6941M/s
   BM_ComputeHash<ByteArrayType>          706878 ns       706770 ns          986 items_per_second=23.1815M/s
   BM_ComputeHash<FLBAType>               762304 ns       762178 ns          941 items_per_second=21.4963M/s
   BM_ComputeHash<Int96Type>              892253 ns       892111 ns          809 items_per_second=18.3654M/s
   BM_BatchComputeHash<Int32Type>         250193 ns       250154 ns         2858 items_per_second=65.4957M/s
   BM_BatchComputeHash<Int64Type>         277338 ns       277289 ns         2518 items_per_second=59.0863M/s
   BM_BatchComputeHash<FloatType>         247188 ns       247143 ns         2832 items_per_second=66.2936M/s
   BM_BatchComputeHash<DoubleType>        278888 ns       278850 ns         2517 items_per_second=58.7557M/s
   BM_BatchComputeHash<ByteArrayType>     256035 ns       255996 ns         2740 items_per_second=64.0009M/s
   BM_BatchComputeHash<FLBAType>          252759 ns       252714 ns         2783 items_per_second=64.8323M/s
   BM_BatchComputeHash<Int96Type>         328592 ns       328540 ns         2119 items_per_second=49.8692M/s
   BM_InsertHash                          689503 ns       689394 ns         1010 items_per_second=23.7658M/s
   BM_BatchInsertHash                     496698 ns       496568 ns         1482 items_per_second=32.9945M/s
   BM_FindExistingHash                    622037 ns       621724 ns         1126 items_per_second=26.3525M/s
   BM_FindNonExistingHash                 658163 ns       657842 ns         1068 items_per_second=24.9057M/s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560426048

   cc @pitrou @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560861947

   At some point it might also be nice to add some basic bloom filter benchmarks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205545197


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());

Review Comment:
   Again, why don't you factor this out?



##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t exists = values[0];
+  for (auto _ : state) {
+    bool found = filter->FindHash(exists);
+    ::benchmark::DoNotOptimize(found);
+  }
+  state.SetItemsProcessed(state.iterations());
+}
+
+static void BM_FindNotExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t unexist = values.back();

Review Comment:
   Same here. Instead, you can generate a different set of hash values from different data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205569255


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;

Review Comment:
   I don't know, is it not a typical row group size? Let's see if it makes the benchmarks too slow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561746965

   I don't know why python CI failed https://github.com/apache/arrow/actions/runs/5071701605/jobs/9108424931?pr=35731 =_=
   
   I've fix the comments and give the benchmark result @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205527025


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);

Review Comment:
   Is this going to generate N times the same exact string?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205544371


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;

Review Comment:
   > I'm curious, is this a typical bloom filter size? Given that bloom filters are grouped by row group, I would perhaps expect a larger value? (row groups are typically quite large)
   
   It's an element size, maybe I can make it 10000, because our system default using 10000 as row-group stride?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205581463


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;

Review Comment:
   Okay



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1564390354

   I use 16K as batch size now, here are benchmark under MacOS, Release(O2):
   
   ```
   ---------------------------------------------------------------------------------------------
   Benchmark                                   Time             CPU   Iterations UserCounters...
   ---------------------------------------------------------------------------------------------
   BM_ComputeHash<Int32Type>              131529 ns       131272 ns         5351 items_per_second=124.809M/s
   BM_ComputeHash<Int64Type>              127332 ns       127222 ns         5498 items_per_second=128.783M/s
   BM_ComputeHash<FloatType>              136299 ns       136063 ns         5176 items_per_second=120.414M/s
   BM_ComputeHash<DoubleType>             130264 ns       129839 ns         5419 items_per_second=126.187M/s
   BM_ComputeHash<ByteArrayType>          141495 ns       141402 ns         4932 items_per_second=115.868M/s
   BM_ComputeHash<FLBAType>               135159 ns       135144 ns         5181 items_per_second=121.233M/s
   BM_ComputeHash<Int96Type>              114941 ns       114874 ns         6118 items_per_second=142.626M/s
   BM_BatchComputeHash<Int32Type>          52456 ns        52429 ns        13406 items_per_second=312.497M/s
   BM_BatchComputeHash<Int64Type>          49498 ns        49484 ns        14178 items_per_second=331.094M/s
   BM_BatchComputeHash<FloatType>          58427 ns        58413 ns        11917 items_per_second=280.486M/s
   BM_BatchComputeHash<DoubleType>         55505 ns        55502 ns        12662 items_per_second=295.195M/s
   BM_BatchComputeHash<ByteArrayType>     111223 ns       111172 ns         6284 items_per_second=147.375M/s
   BM_BatchComputeHash<FLBAType>          106343 ns       106319 ns         6621 items_per_second=154.103M/s
   BM_BatchComputeHash<Int96Type>          73442 ns        73427 ns         9546 items_per_second=223.134M/s
   BM_InsertHash                        22066188 ns     22038160 ns          100 items_per_second=743.438k/s
   BM_BatchInsertHash                   21561561 ns     21540280 ns          100 items_per_second=760.621k/s
   BM_FindExistsHash                      436261 ns       436161 ns         1594 items_per_second=2.29273k/s
   BM_FindNotExistsHash                   497617 ns       497551 ns         1421 items_per_second=2.00984k/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204456036


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -323,17 +324,113 @@ const int64_t HASHES_OF_LOOPING_BYTES_WITH_SEED_0[32] = {
  * }
  */
 TEST(XxHashTest, TestBloomFilter) {
-  uint8_t bytes[32] = {};
+  constexpr int kNumValues = 32;
+  uint8_t bytes[kNumValues] = {};
 
-  for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+  for (int i = 0; i < kNumValues; i++) {
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  constexpr int kNumValues = 32;
+  uint8_t bytes[kNumValues] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < kNumValues; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(kNumValues);
+  hasher_seed_0->Hashes(byte_array_vector.data(),
+                        static_cast<int>(byte_array_vector.size()), hashes.data());
+  for (int i = 0; i < kNumValues; i++) {
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hashes[i])
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+template <typename DType>
+class TestBatchBloomFilter : public testing::Test {
+ public:
+  constexpr static int kTestDataSize = 64;
+
+  // GenerateTestData with size 64.
+  std::vector<typename DType::c_type> GenerateTestData();
+
+  // The Lifetime owner for Test data
+  std::vector<uint8_t> members;
+};
+
+template <typename DType>
+std::vector<typename DType::c_type> TestBatchBloomFilter<DType>::GenerateTestData() {
+  std::vector<typename DType::c_type> values(kTestDataSize);
+  GenerateData(kTestDataSize, values.data(), &members);
+  return values;
+}
+
+// Note: BloomFilter doesn't support BooleanType.
+using BloomFilterTestTypes = ::testing::Types<Int32Type, Int64Type, FloatType, DoubleType,
+                                              Int96Type, FLBAType, ByteArrayType>;
+
+TYPED_TEST_SUITE(TestBatchBloomFilter, BloomFilterTestTypes);
+
+TYPED_TEST(TestBatchBloomFilter, Basic) {
+  using Type = typename TypeParam::c_type;
+  std::vector<Type> test_data = TestFixture::GenerateTestData();
+  BlockSplitBloomFilter batch_insert_filter;
+  BlockSplitBloomFilter filter;
+
+  // Bloom filter fpp parameter
+  const double fpp = 0.05;
+  filter.Init(BlockSplitBloomFilter::OptimalNumOfBytes(TestFixture::kTestDataSize, fpp));
+  batch_insert_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(TestFixture::kTestDataSize, fpp));
+
+  std::vector<uint64_t> hashes;
+  for (int i = 0; i < static_cast<int>(test_data.size()); ++i) {
+    uint64_t hash = 0;
+    if constexpr (std::is_same_v<Type, FLBA>) {
+      hash = filter.Hash(&test_data[i], kGenerateDataFLBALength);
+    } else {
+      hash = filter.Hash(&test_data[i]);
+    }
+    hashes.push_back(hash);
+  }
+
+  std::vector<uint64_t> batch_hashes(test_data.size());
+  if constexpr (std::is_same_v<Type, FLBA>) {
+    batch_insert_filter.Hashes(test_data.data(), kGenerateDataFLBALength,
+                               static_cast<int>(test_data.size()), batch_hashes.data());
+  } else {
+    batch_insert_filter.Hashes(test_data.data(), static_cast<int>(test_data.size()),
+                               batch_hashes.data());
+  }
+
+  EXPECT_EQ(hashes, batch_hashes);
+
+  std::shared_ptr<Buffer> buffer;
+  std::shared_ptr<Buffer> batch_insert_buffer;
+  {
+    auto sink = CreateOutputStream();
+    filter.WriteTo(sink.get());
+    ASSERT_OK_AND_ASSIGN(buffer, sink->Finish());
+  }
+  {
+    auto sink = CreateOutputStream();
+    batch_insert_filter.WriteTo(sink.get());
+    ASSERT_OK_AND_ASSIGN(batch_insert_buffer, sink->Finish());
+  }
+  EXPECT_TRUE(buffer->Equals(*batch_insert_buffer));

Review Comment:
   Can use `AssertBufferEqual` for better error messages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204445959


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), test::kGenerateDataFLBALength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }

Review Comment:
   Should make sure that the compiler doesn't optimize things away.
   ```suggestion
       }
       benchmark::ClobberMemory();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204461303


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;

Review Comment:
   Emm I apply a `DoNotOptimize` on filter, I guess it might works?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204477114


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);

Review Comment:
   So you means that benchmark should have stable output? Should I write a generator here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203842955


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,19 +257,52 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);
+    }
+  }
   void WriteTo(ArrowOutputStream* sink) const override;
   uint32_t GetBitsetSize() const override { return num_bytes_; }
 
+  uint64_t Hash(int32_t value) const override { return hasher_->Hash(value); }
   uint64_t Hash(int64_t value) const override { return hasher_->Hash(value); }
   uint64_t Hash(float value) const override { return hasher_->Hash(value); }
   uint64_t Hash(double value) const override { return hasher_->Hash(value); }
   uint64_t Hash(const Int96* value) const override { return hasher_->Hash(value); }
   uint64_t Hash(const ByteArray* value) const override { return hasher_->Hash(value); }
-  uint64_t Hash(int32_t value) const override { return hasher_->Hash(value); }
   uint64_t Hash(const FLBA* value, uint32_t len) const override {
     return hasher_->Hash(value, len);
   }
 
+  void Hashes(const int32_t* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const int64_t* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const float* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const double* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const Int96* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const ByteArray* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const FLBA* values, uint32_t type_len, int num_values,
+              uint64_t* hashes) const override {
+    hasher_->Hashes(values, type_len, num_values, hashes);
+  }
+
+  uint64_t Hash(int32_t* value) const { return hasher_->Hash(*value); }

Review Comment:
   Please make the pointers `const` here.



##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,19 +257,52 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);
+    }
+  }
   void WriteTo(ArrowOutputStream* sink) const override;
   uint32_t GetBitsetSize() const override { return num_bytes_; }
 
+  uint64_t Hash(int32_t value) const override { return hasher_->Hash(value); }
   uint64_t Hash(int64_t value) const override { return hasher_->Hash(value); }
   uint64_t Hash(float value) const override { return hasher_->Hash(value); }
   uint64_t Hash(double value) const override { return hasher_->Hash(value); }
   uint64_t Hash(const Int96* value) const override { return hasher_->Hash(value); }
   uint64_t Hash(const ByteArray* value) const override { return hasher_->Hash(value); }
-  uint64_t Hash(int32_t value) const override { return hasher_->Hash(value); }
   uint64_t Hash(const FLBA* value, uint32_t len) const override {
     return hasher_->Hash(value, len);
   }
 
+  void Hashes(const int32_t* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const int64_t* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const float* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const double* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const Int96* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const ByteArray* values, int num_values, uint64_t* hashes) const override {
+    hasher_->Hashes(values, num_values, hashes);
+  }
+  void Hashes(const FLBA* values, uint32_t type_len, int num_values,
+              uint64_t* hashes) const override {
+    hasher_->Hashes(values, type_len, num_values, hashes);
+  }
+
+  uint64_t Hash(int32_t* value) const { return hasher_->Hash(*value); }

Review Comment:
   ```suggestion
     uint64_t Hash(const int32_t* value) const { return hasher_->Hash(*value); }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203842488


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   The point here is to avoid calling `InsertHash` many times, so this trivial implementation doesn't seem useful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203858161


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < 32; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(32);
+  hasher_seed_0->Hashes(byte_array_vector.data(),
+                        static_cast<int>(byte_array_vector.size()), hashes.data());
+  for (int i = 0; i < 32; i++) {
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hashes[i])
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+template <typename DType>
+class TestBatchBloomFilter : public testing::Test {
+ public:
+  using T = typename DType::c_type;
+
+  void SetUp() {}
+
+  void TearDown() {}

Review Comment:
   These are unnecessary, remove them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203849613


##########
cpp/src/parquet/hasher.h:
##########
@@ -66,6 +66,58 @@ class Hasher {
   /// @param len the value length.
   virtual uint64_t Hash(const FLBA* value, uint32_t len) const = 0;
 
+  /// Batch compute hashes for 32 bits values by using its plain encoding result.
+  ///
+  /// @param values the value to hash.
+  /// @param num_values the number of values to hash.
+  /// @param hashes the output hash value, it length should be equal to num_values.

Review Comment:
   ```suggestion
     /// @param hashes a pointer to the output hash values, its length should be equal to num_values.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1561074737

   I've change the implemention to benchmark `BloomFilter*` now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205533882


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();

Review Comment:
   This line isn't useful as memory is not modified in this BM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205536630


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;

Review Comment:
   Also, let's call this `kNumBloomFilterInserts` or something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205549838


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t exists = values[0];
+  for (auto _ : state) {
+    bool found = filter->FindHash(exists);
+    ::benchmark::DoNotOptimize(found);
+  }
+  state.SetItemsProcessed(state.iterations());
+}
+
+static void BM_FindNotExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t unexist = values.back();

Review Comment:
   Yes, but statistically there should be few collisions. Since a bloom filter is a statistical structure, there may be false positives anyway (AFAIU).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205541875


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());

Review Comment:
   Is it the exact same code as in `BM_InsertHash`? Can you factor it out?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203876902


##########
cpp/src/parquet/bloom_filter_test.cc:
##########
@@ -326,14 +327,137 @@ TEST(XxHashTest, TestBloomFilter) {
   uint8_t bytes[32] = {};
 
   for (int i = 0; i < 32; i++) {
-    ByteArray byteArray(i, bytes);
+    ByteArray byte_array(i, bytes);
     bytes[i] = i;
 
     auto hasher_seed_0 = std::make_unique<XxHasher>();
-    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byteArray))
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hasher_seed_0->Hash(&byte_array))
         << "Hash with seed 0 Error: " << i;
   }
 }
 
+// Same as TestBloomFilter but using Batch interface
+TEST(XxHashTest, TestBloomFilterHashes) {
+  uint8_t bytes[32] = {};
+
+  std::vector<ByteArray> byte_array_vector;
+  for (int i = 0; i < 32; i++) {
+    bytes[i] = i;
+    byte_array_vector.emplace_back(i, bytes);
+  }
+  auto hasher_seed_0 = std::make_unique<XxHasher>();
+  std::vector<uint64_t> hashes;
+  hashes.resize(32);
+  hasher_seed_0->Hashes(byte_array_vector.data(),
+                        static_cast<int>(byte_array_vector.size()), hashes.data());
+  for (int i = 0; i < 32; i++) {
+    EXPECT_EQ(HASHES_OF_LOOPING_BYTES_WITH_SEED_0[i], hashes[i])
+        << "Hash with seed 0 Error: " << i;
+  }
+}
+
+template <typename DType>
+class TestBatchBloomFilter : public testing::Test {
+ public:
+  using T = typename DType::c_type;

Review Comment:
   Nice catch! I'll do it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1203901048


##########
cpp/src/parquet/bloom_filter.h:
##########
@@ -200,6 +257,11 @@ class PARQUET_EXPORT BlockSplitBloomFilter : public BloomFilter {
 
   bool FindHash(uint64_t hash) const override;
   void InsertHash(uint64_t hash) override;
+  void InsertHashes(const uint64_t* hashes, int num_values) override {
+    for (int i = 0; i < num_values; ++i) {
+      InsertHash(hashes[i]);

Review Comment:
   "final" is not magic and will probably only work if you call `InsertHash` on the derived class, not on a base class pointer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204450465


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), test::kGenerateDataFLBALength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->FindHash(hash);
+    }

Review Comment:
   This measures performance for existing hash values, but can you also add a benchmark for missing hash values? The performance should be different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204444719


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }

Review Comment:
   You should make sure the compile doesn't optimize the code away.
   ```suggestion
         if constexpr (std::is_same_v<DType, FLBAType>) {
           benchmark::DoNotOptimize(filter->Hash(&values[i], test::kGenerateDataFLBALength));
         } else {
           benchmark::DoNotOptimize(filter->Hash(values[i]));
         }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1205548294


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,230 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+
+#include <random>
+
+namespace parquet {
+namespace benchmark {
+
+constexpr static uint32_t kBloomFilterElementSize = 1024;
+
+std::unique_ptr<BloomFilter> createBloomFilter(uint32_t elementSize) {
+  std::unique_ptr<BlockSplitBloomFilter> block_split_bloom_filter =
+      std::make_unique<BlockSplitBloomFilter>();
+  block_split_bloom_filter->Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(elementSize, /*fpp=*/0.05));
+  std::unique_ptr<BloomFilter> bloom_filter = std::move(block_split_bloom_filter);
+  ::benchmark::DoNotOptimize(bloom_filter);
+  return bloom_filter;
+}
+
+constexpr static uint32_t kGenerateBenchmarkDataStringLength = 8;
+
+void GenerateRandomString(uint32_t length, std::vector<uint8_t>* heap) {
+  // Character set used to generate random string
+  const std::string charset =
+      "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+
+  std::default_random_engine gen(0);
+  std::uniform_int_distribution<uint32_t> dist(0, static_cast<int>(charset.size() - 1));
+
+  for (uint32_t i = 0; i < length; i++) {
+    heap->push_back(charset[dist(gen)]);
+  }
+}
+
+template <typename T>
+void GenerateBenchmarkData(uint32_t size, T* data,
+                           [[maybe_unused]] std::vector<uint8_t>* heap) {
+  if constexpr (std::is_integral_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<T> d(std::numeric_limits<T>::min(),
+                                       std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_floating_point_v<T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_real_distribution<T> d(std::numeric_limits<T>::lowest(),
+                                        std::numeric_limits<T>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i] = d(gen);
+    }
+  } else if constexpr (std::is_same_v<FLBA, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<ByteArray, T>) {
+    GenerateRandomString(kGenerateBenchmarkDataStringLength * size, heap);
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].ptr = heap->data() + i * kGenerateBenchmarkDataStringLength;
+      data[i].len = kGenerateBenchmarkDataStringLength;
+    }
+  } else if constexpr (std::is_same_v<Int96, T>) {
+    std::default_random_engine gen(/*seed*/ 0);
+    std::uniform_int_distribution<int> d(std::numeric_limits<int>::min(),
+                                         std::numeric_limits<int>::max());
+    for (uint32_t i = 0; i < size; ++i) {
+      data[i].value[0] = d(gen);
+      data[i].value[1] = d(gen);
+      data[i].value[2] = d(gen);
+    }
+  }
+}
+
+template <typename DType>
+static void BM_ComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  for (auto _ : state) {
+    for (const auto& value : values) {
+      uint64_t hash = 0;
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        hash = filter->Hash(&value, kGenerateBenchmarkDataStringLength);
+      } else if constexpr (std::is_same_v<DType, Int96Type>) {
+        hash = filter->Hash(&value);
+      } else if constexpr (std::is_same_v<DType, ByteArrayType>) {
+        hash = filter->Hash(&value);
+      } else {
+        hash = filter->Hash(value);
+      }
+      ::benchmark::DoNotOptimize(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchComputeHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), kGenerateBenchmarkDataStringLength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+    ::benchmark::DoNotOptimize(hashes);
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+      ::benchmark::ClobberMemory();
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+    ::benchmark::ClobberMemory();
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t exists = values[0];
+  for (auto _ : state) {
+    bool found = filter->FindHash(exists);
+    ::benchmark::DoNotOptimize(found);
+  }
+  state.SetItemsProcessed(state.iterations());
+}
+
+static void BM_FindNotExistsHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(kBloomFilterElementSize);
+  std::vector<uint8_t> heap;
+  GenerateBenchmarkData(kBloomFilterElementSize, values.data(), &heap);
+  auto filter = createBloomFilter(kBloomFilterElementSize);
+  std::vector<uint64_t> hashes(kBloomFilterElementSize);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  // Insert all but the last hash.
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()) - 1);
+  uint32_t unexist = values.back();

Review Comment:
   > Same here. Instead, you can generate a different set of hash values from different data.
   
   Generate "different" hash may means keep a set of hash values if hash collision happens...I'll have a try



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1560408871

   * Closes: #35729


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35731:
URL: https://github.com/apache/arrow/pull/35731#discussion_r1204466779


##########
cpp/src/parquet/bloom_filter_benchmark.cc:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "parquet/bloom_filter.h"
+#include "parquet/test_util.h"
+
+namespace parquet {
+namespace benchmark {
+
+template <typename DType>
+static void BM_CountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  for (auto _ : state) {
+    for (int i = 0; i < 1024; ++i) {
+      if constexpr (std::is_same_v<DType, FLBAType>) {
+        filter->Hash(&values[i], test::kGenerateDataFLBALength);
+      } else {
+        filter->Hash(values[i]);
+      }
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+template <typename DType>
+static void BM_BatchCountHash(::benchmark::State& state) {
+  using T = typename DType::c_type;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  for (auto _ : state) {
+    if constexpr (std::is_same_v<DType, FLBAType>) {
+      filter->Hashes(values.data(), test::kGenerateDataFLBALength,
+                     static_cast<int>(values.size()), hashes.data());
+    } else {
+      filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_InsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->InsertHash(hash);
+    }
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_BatchInsertHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  for (auto _ : state) {
+    filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  }
+  state.SetItemsProcessed(state.iterations() * values.size());
+}
+
+static void BM_FindHash(::benchmark::State& state) {
+  using T = int32_t;
+  std::vector<T> values(1024);
+  std::vector<uint8_t> heap;
+  test::GenerateData(1024, values.data(), &heap);
+  BlockSplitBloomFilter block_split_bloom_filter;
+  block_split_bloom_filter.Init(
+      BlockSplitBloomFilter::OptimalNumOfBytes(1024, /*fpp=*/0.05));
+  BloomFilter* filter = &block_split_bloom_filter;
+  ::benchmark::DoNotOptimize(filter);
+  std::vector<uint64_t> hashes(1024);
+  filter->Hashes(values.data(), static_cast<int>(values.size()), hashes.data());
+  filter->InsertHashes(hashes.data(), static_cast<int>(hashes.size()));
+  for (auto _ : state) {
+    for (auto hash : hashes) {
+      filter->FindHash(hash);
+    }

Review Comment:
   Okay, will do it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35731: GH-35729: [C++][Parquet] Implement batch interface for BloomFilter in Parquet

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35731:
URL: https://github.com/apache/arrow/pull/35731#issuecomment-1568790603

   CI failures are unrelated, I'm merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org