You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/18 04:15:27 UTC

[GitHub] [arrow] jianxind opened a new pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

jianxind opened a new pull request #7213:
URL: https://github.com/apache/arrow/pull/7213


   Benchmark(1%, 10%, 50% null probabilities) and unittest for future SIMD optimizations.
   
   Signed-off-by: Frank Du <fr...@intel.com>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#issuecomment-632819189


   I just stumbled on failures on MSVC in these benchmarks (see https://issues.apache.org/jira/browse/ARROW-8892). I'm fixing the immediate issues that I found but the problem of checking that benchmarks compile in CI will have to be done separately
   
   In general, I think we should be less liberal about using `auto` to help avoid issues like this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7213:
URL: https://github.com/apache/arrow/pull/7213


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r426976016



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,153 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  const auto values = new bool[num_values];

Review comment:
       Unlike the Put which support both T* and vector inputs, current PutSpaced only has interface to T*. I has to use bool[] here, it still need convert to bool* if using std::vector<bool>. 
     virtual void Put(const T* src, int num_values) = 0;
     virtual void Put(const std::vector<T>& src, int num_values = -1);
     virtual void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
                            int64_t valid_bits_offset) = 0;
   
   Another alternate way is std::vector<uint8_t> instead, and reinterpret_cast the data of std::vector<uint8_t> to bool?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427133854



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,130 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+template <typename ParquetType>
+struct BM_SpacedEncodingTraits;
+
+template <>
+struct BM_SpacedEncodingTraits<BooleanType> {
+  // Leverage UInt8 vector array data for Boolean, the input src of PutSpaced is bool*
+  using ArrowType = ::arrow::UInt8Type;
+  using ArrayType = ::arrow::UInt8Array;
+  using CType = bool;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<FloatType> {
+  using ArrowType = typename EncodingTraits<FloatType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename FloatType::c_type;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<DoubleType> {
+  using ArrowType = typename EncodingTraits<DoubleType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename DoubleType::c_type;
+};
+
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/90});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/99});
+}
+
+template <typename ParquetType>
+static void BM_PlainEncodingSpaced(benchmark::State& state) {
+  using ArrowType = typename BM_SpacedEncodingTraits<ParquetType>::ArrowType;
+  using ArrayType = typename BM_SpacedEncodingTraits<ParquetType>::ArrayType;
+  using CType = typename BM_SpacedEncodingTraits<ParquetType>::CType;
+
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  auto rand = ::arrow::random::RandomArrayGenerator(1923);
+  const auto array = rand.Numeric<ArrowType>(num_values, -100, 100, null_percent);
+  const auto valid_bits = array->null_bitmap_data();
+  const auto array_actual = arrow::internal::checked_pointer_cast<ArrayType>(array);
+
+  auto encoder = MakeTypedEncoder<ParquetType>(Encoding::PLAIN);
+  for (auto _ : state) {
+    // Cast only happens for special BooleanType as it use UInt8 for the array data to
+    // match a bool* input to PutSpaced.
+    encoder->PutSpaced(reinterpret_cast<const CType*>(array_actual->raw_values()),
+                       num_values, valid_bits, 0);
+    encoder->FlushValues();
+  }
+  state.SetBytesProcessed(state.iterations() * num_values * sizeof(CType));
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  BM_PlainEncodingSpaced<BooleanType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedBoolean)->Apply(BM_PlainSpacedArgs);
+
+static void BM_PlainEncodingSpacedFloat(benchmark::State& state) {
+  BM_PlainEncodingSpaced<FloatType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedFloat)->Apply(BM_PlainSpacedArgs);
+
+static void BM_PlainEncodingSpacedDouble(benchmark::State& state) {
+  BM_PlainEncodingSpaced<DoubleType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedDouble)->Apply(BM_PlainSpacedArgs);
+
+template <typename ParquetType>
+static void BM_PlainDecodingSpaced(benchmark::State& state) {
+  using ArrowType = typename BM_SpacedEncodingTraits<ParquetType>::ArrowType;
+  using ArrayType = typename BM_SpacedEncodingTraits<ParquetType>::ArrayType;
+  using CType = typename BM_SpacedEncodingTraits<ParquetType>::CType;
+
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  auto rand = ::arrow::random::RandomArrayGenerator(1923);
+  const auto array = rand.Numeric<ArrowType>(num_values, -100, 100, null_percent);
+  const auto valid_bits = array->null_bitmap_data();
+  const auto null_count = array->null_count();
+  const auto array_actual = arrow::internal::checked_pointer_cast<ArrayType>(array);
+
+  auto encoder = MakeTypedEncoder<ParquetType>(Encoding::PLAIN);
+  // Cast only happens for special BooleanType as it use UInt8 for the array data to match
+  // a bool* input to PutSpaced.

Review comment:
       I just added a static_assert to guarantee the size match between the cast




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r426979057



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,153 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  const auto values = new bool[num_values];

Review comment:
       I forgot about the bit-packing of std::vector<bool>. But yes, a `std::vector<char>` with `static_cast<bool*>(values.data())` should work. Or just use Int8Array like the other benchmark. It might be worth adding a quick comment about this discrepancy.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427138521



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -25,6 +25,7 @@
 #include "arrow/testing/util.h"
 #include "arrow/type.h"
 #include "arrow/util/byte_stream_split.h"
+#include "arrow/util/cpu_info.h"

Review comment:
       Why did you need this by the way?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#issuecomment-633461302


   Sorry that using too many autos here, I will notice it for further PRs. Thanks. Adding more case in CI is the fix otherwise I need to setup a MSVC context locally.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r426979057



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,153 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  const auto values = new bool[num_values];

Review comment:
       I forgot about the bit-packing of std::vector<bool>. But yes, a std::vector<char> with `static_cast<bool*>(values.data()) should work. Or just use Int8Array like the other benchmark. It might be worth adding a quick comment about this discrepancy.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r426827408



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,153 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  const auto values = new bool[num_values];

Review comment:
       Use std::vector.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427214775



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -25,6 +25,7 @@
 #include "arrow/testing/util.h"
 #include "arrow/type.h"
 #include "arrow/util/byte_stream_split.h"
+#include "arrow/util/cpu_info.h"

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#issuecomment-630520835


   > Suggestions:
   > 
   > * Either `Int32` and `Int64`, or `Float` and `Double`, are sufficient; I don't think both at the same time are useful (they should use the same respective code paths)
   > * It may be nice to add tests with null probability 90% and 99%
   
   Thanks. Yes the int32/int64 are the same path to float/double, even for SIMD, I removed it. 90% and 99% added also.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427200565



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -25,6 +25,7 @@
 #include "arrow/testing/util.h"
 #include "arrow/type.h"
 #include "arrow/util/byte_stream_split.h"
+#include "arrow/util/cpu_info.h"

Review comment:
       Using a hardcoded size would be just as good (or even better, to compare between machines).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427133854



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,130 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+template <typename ParquetType>
+struct BM_SpacedEncodingTraits;
+
+template <>
+struct BM_SpacedEncodingTraits<BooleanType> {
+  // Leverage UInt8 vector array data for Boolean, the input src of PutSpaced is bool*
+  using ArrowType = ::arrow::UInt8Type;
+  using ArrayType = ::arrow::UInt8Array;
+  using CType = bool;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<FloatType> {
+  using ArrowType = typename EncodingTraits<FloatType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename FloatType::c_type;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<DoubleType> {
+  using ArrowType = typename EncodingTraits<DoubleType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename DoubleType::c_type;
+};
+
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/90});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/99});
+}
+
+template <typename ParquetType>
+static void BM_PlainEncodingSpaced(benchmark::State& state) {
+  using ArrowType = typename BM_SpacedEncodingTraits<ParquetType>::ArrowType;
+  using ArrayType = typename BM_SpacedEncodingTraits<ParquetType>::ArrayType;
+  using CType = typename BM_SpacedEncodingTraits<ParquetType>::CType;
+
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  auto rand = ::arrow::random::RandomArrayGenerator(1923);
+  const auto array = rand.Numeric<ArrowType>(num_values, -100, 100, null_percent);
+  const auto valid_bits = array->null_bitmap_data();
+  const auto array_actual = arrow::internal::checked_pointer_cast<ArrayType>(array);
+
+  auto encoder = MakeTypedEncoder<ParquetType>(Encoding::PLAIN);
+  for (auto _ : state) {
+    // Cast only happens for special BooleanType as it use UInt8 for the array data to
+    // match a bool* input to PutSpaced.
+    encoder->PutSpaced(reinterpret_cast<const CType*>(array_actual->raw_values()),
+                       num_values, valid_bits, 0);
+    encoder->FlushValues();
+  }
+  state.SetBytesProcessed(state.iterations() * num_values * sizeof(CType));
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  BM_PlainEncodingSpaced<BooleanType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedBoolean)->Apply(BM_PlainSpacedArgs);
+
+static void BM_PlainEncodingSpacedFloat(benchmark::State& state) {
+  BM_PlainEncodingSpaced<FloatType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedFloat)->Apply(BM_PlainSpacedArgs);
+
+static void BM_PlainEncodingSpacedDouble(benchmark::State& state) {
+  BM_PlainEncodingSpaced<DoubleType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedDouble)->Apply(BM_PlainSpacedArgs);
+
+template <typename ParquetType>
+static void BM_PlainDecodingSpaced(benchmark::State& state) {
+  using ArrowType = typename BM_SpacedEncodingTraits<ParquetType>::ArrowType;
+  using ArrayType = typename BM_SpacedEncodingTraits<ParquetType>::ArrayType;
+  using CType = typename BM_SpacedEncodingTraits<ParquetType>::CType;
+
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  auto rand = ::arrow::random::RandomArrayGenerator(1923);
+  const auto array = rand.Numeric<ArrowType>(num_values, -100, 100, null_percent);
+  const auto valid_bits = array->null_bitmap_data();
+  const auto null_count = array->null_count();
+  const auto array_actual = arrow::internal::checked_pointer_cast<ArrayType>(array);
+
+  auto encoder = MakeTypedEncoder<ParquetType>(Encoding::PLAIN);
+  // Cast only happens for special BooleanType as it use UInt8 for the array data to match
+  // a bool* input to PutSpaced.

Review comment:
       I just added a static_assert to guarantee the size math between the cast




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#issuecomment-629936118


   https://issues.apache.org/jira/browse/ARROW-8841


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r426979057



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,153 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  const auto values = new bool[num_values];

Review comment:
       I forgot about the bit-packing of std::vector<bool>. But yes, a `std::vector<char>` with `static_cast<bool*>(values.data()) should work. Or just use Int8Array like the other benchmark. It might be worth adding a quick comment about this discrepancy.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427099420



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,130 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+template <typename ParquetType>
+struct BM_SpacedEncodingTraits;
+
+template <>
+struct BM_SpacedEncodingTraits<BooleanType> {
+  // Leverage UInt8 vector array data for Boolean, the input src of PutSpaced is bool*
+  using ArrowType = ::arrow::UInt8Type;
+  using ArrayType = ::arrow::UInt8Array;
+  using CType = bool;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<FloatType> {
+  using ArrowType = typename EncodingTraits<FloatType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename FloatType::c_type;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<DoubleType> {
+  using ArrowType = typename EncodingTraits<DoubleType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename DoubleType::c_type;
+};
+
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/90});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/99});
+}
+
+template <typename ParquetType>
+static void BM_PlainEncodingSpaced(benchmark::State& state) {
+  using ArrowType = typename BM_SpacedEncodingTraits<ParquetType>::ArrowType;
+  using ArrayType = typename BM_SpacedEncodingTraits<ParquetType>::ArrayType;
+  using CType = typename BM_SpacedEncodingTraits<ParquetType>::CType;
+
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  auto rand = ::arrow::random::RandomArrayGenerator(1923);
+  const auto array = rand.Numeric<ArrowType>(num_values, -100, 100, null_percent);
+  const auto valid_bits = array->null_bitmap_data();
+  const auto array_actual = arrow::internal::checked_pointer_cast<ArrayType>(array);
+
+  auto encoder = MakeTypedEncoder<ParquetType>(Encoding::PLAIN);
+  for (auto _ : state) {
+    // Cast only happens for special BooleanType as it use UInt8 for the array data to
+    // match a bool* input to PutSpaced.
+    encoder->PutSpaced(reinterpret_cast<const CType*>(array_actual->raw_values()),
+                       num_values, valid_bits, 0);
+    encoder->FlushValues();
+  }
+  state.SetBytesProcessed(state.iterations() * num_values * sizeof(CType));
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  BM_PlainEncodingSpaced<BooleanType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedBoolean)->Apply(BM_PlainSpacedArgs);
+
+static void BM_PlainEncodingSpacedFloat(benchmark::State& state) {
+  BM_PlainEncodingSpaced<FloatType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedFloat)->Apply(BM_PlainSpacedArgs);
+
+static void BM_PlainEncodingSpacedDouble(benchmark::State& state) {
+  BM_PlainEncodingSpaced<DoubleType>(state);
+}
+BENCHMARK(BM_PlainEncodingSpacedDouble)->Apply(BM_PlainSpacedArgs);
+
+template <typename ParquetType>
+static void BM_PlainDecodingSpaced(benchmark::State& state) {
+  using ArrowType = typename BM_SpacedEncodingTraits<ParquetType>::ArrowType;
+  using ArrayType = typename BM_SpacedEncodingTraits<ParquetType>::ArrayType;
+  using CType = typename BM_SpacedEncodingTraits<ParquetType>::CType;
+
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  auto rand = ::arrow::random::RandomArrayGenerator(1923);
+  const auto array = rand.Numeric<ArrowType>(num_values, -100, 100, null_percent);
+  const auto valid_bits = array->null_bitmap_data();
+  const auto null_count = array->null_count();
+  const auto array_actual = arrow::internal::checked_pointer_cast<ArrayType>(array);
+
+  auto encoder = MakeTypedEncoder<ParquetType>(Encoding::PLAIN);
+  // Cast only happens for special BooleanType as it use UInt8 for the array data to match
+  // a bool* input to PutSpaced.

Review comment:
       Is `bool` guaranteed to be the same as `uint8_t`???




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#issuecomment-630226330


   Suggestions:
   * Either `Int32` and `Int64`, or `Float` and `Double`, are sufficient; I don't think both at the same time are useful (they should use the same respective code paths)
   * It may be nice to add tests with null probability 90% and 99%
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427133113



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,130 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+template <typename ParquetType>
+struct BM_SpacedEncodingTraits;
+
+template <>
+struct BM_SpacedEncodingTraits<BooleanType> {
+  // Leverage UInt8 vector array data for Boolean, the input src of PutSpaced is bool*
+  using ArrowType = ::arrow::UInt8Type;
+  using ArrayType = ::arrow::UInt8Array;
+  using CType = bool;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<FloatType> {
+  using ArrowType = typename EncodingTraits<FloatType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename FloatType::c_type;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<DoubleType> {

Review comment:
       Thanks, done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427099731



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,130 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+template <typename ParquetType>
+struct BM_SpacedEncodingTraits;
+
+template <>
+struct BM_SpacedEncodingTraits<BooleanType> {
+  // Leverage UInt8 vector array data for Boolean, the input src of PutSpaced is bool*
+  using ArrowType = ::arrow::UInt8Type;
+  using ArrayType = ::arrow::UInt8Array;
+  using CType = bool;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<FloatType> {
+  using ArrowType = typename EncodingTraits<FloatType>::ArrowType;
+  using ArrayType = typename arrow::TypeTraits<ArrowType>::ArrayType;
+  using CType = typename FloatType::c_type;
+};
+
+template <>
+struct BM_SpacedEncodingTraits<DoubleType> {

Review comment:
       Instead of repeating yourself here, you could make this the default `BM_SpacedEncodingTraits`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427009705



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -199,6 +200,153 @@ static void BM_PlainDecodingFloat(benchmark::State& state) {
 
 BENCHMARK(BM_PlainDecodingFloat)->Range(MIN_RANGE, MAX_RANGE);
 
+static void BM_PlainSpacedArgs(benchmark::internal::Benchmark* bench) {
+  static const auto BM_kPlainSpacedSize =
+      arrow::internal::CpuInfo::GetInstance()->CacheSize(
+          arrow::internal::CpuInfo::L1_CACHE);
+
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/1});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/10});
+  bench->Args({/*size*/ BM_kPlainSpacedSize, /*null_percentage=*/50});
+}
+
+static void BM_PlainEncodingSpacedBoolean(benchmark::State& state) {
+  const auto num_values = state.range(0);
+  const double null_percent = static_cast<double>(state.range(1)) / 100.0;
+
+  const auto values = new bool[num_values];

Review comment:
       Good catch to use Int8Array thus we can reuse the template for bool also. Thanks. Uploaded the change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7213: ARROW-8841: [C++] Add benchmark and unittest for encoding::PLAIN spaced

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7213:
URL: https://github.com/apache/arrow/pull/7213#discussion_r427192686



##########
File path: cpp/src/parquet/encoding_benchmark.cc
##########
@@ -25,6 +25,7 @@
 #include "arrow/testing/util.h"
 #include "arrow/type.h"
 #include "arrow/util/byte_stream_split.h"
+#include "arrow/util/cpu_info.h"

Review comment:
       The size of test spaced block was set to L1 size. 
   arrow::internal::CpuInfo::GetInstance()->CacheSize(arrow::internal::CpuInfo::L1_CACHE);




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org