You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/09 16:14:59 UTC

[GitHub] [arrow] wesm opened a new pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size

wesm opened a new pull request #7382:
URL: https://github.com/apache/arrow/pull/7382


   I was going to work in Filter in this PR but it got to be too much to review so I'll tackle that in a separate PR. 
   
   Summary of what's in this PR:
   
   * Uses BitBlockCounter to speed up processing of mostly-not-null indices vectors
   * Performance of primitive takes improved ~2-3x across the board. Builder classes are no longer used for primitive types.
   * Size of vector_take.cc.o reduced from 5.9 MB to 468 KB on -O3 build with clang-8. Compilation time is correspondingly reduced. The refactor for the new kernels framework I think has increased the code size in this module beyond what was in 0.17.x.
   * Kernels for primitive types of same size are reused
   * Signed/unsigned indices are processed using unsigned integer code paths after they've been boundschecked (so we know that the signed ints have no negative values)
   * Adds new kernel input type checking rules so that the number of registered take kernels has gone from over 300 to only 11. This means faster dispatching, too. 
   * Does vectorized boundschecking
   * Take no longer uses kernels/vector_selection_internal.h, but it's still used by Filter. I plan to delete it after completing the the Filter implementation
   * Fixes to the prior benchmarking PR
   
   Note: support for doing Take with unions has been temporarily disabled. Since there is so little code in the codebase that deals with unions at the moment I felt it would be better to address this in follow up work. Here are the currently available kernels:
   
   ```
   [VectorKernel<(array[primitive], array[integer]) -> computed>,
    VectorKernel<(array[binary-like], array[integer]) -> computed>,
    VectorKernel<(array[large-binary-like], array[integer]) -> computed>,
    VectorKernel<(array[null], array[integer]) -> computed>,
    VectorKernel<(array[Type::DICTIONARY], array[integer]) -> computed>,
    VectorKernel<(array[Type::EXTENSION], array[integer]) -> computed>,
    VectorKernel<(array[Type::LIST], array[integer]) -> computed>,
    VectorKernel<(array[Type::LARGE_LIST], array[integer]) -> computed>,
    VectorKernel<(array[Type::FIXED_SIZE_LIST], array[integer]) -> computed>,
    VectorKernel<(array[Type::STRUCT], array[integer]) -> computed>,
    VectorKernel<(array[Type::MAP], array[integer]) -> computed>]
   ```
   
   I'd like to write some more unit tests to probe some scenarios that aren't visited by the current tests but wanted to get this up for review. There's some code duplication that can be improved so I think things can be improved in follow up PRs using the benchmarks and code size metrics as a strict guide for making changes. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642057943


   Done. Please review


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r438759834



##########
File path: cpp/src/arrow/util/bit_block_counter.h
##########
@@ -17,14 +17,21 @@
 
 #pragma once
 
+#include <algorithm>
 #include <cstdint>
+#include <limits>
+#include <memory>
 
+#include "arrow/array/array_base.h"

Review comment:
       Is this useful?

##########
File path: cpp/src/arrow/compute/kernels/vector_take_test.cc
##########
@@ -211,13 +306,63 @@ TYPED_TEST(TestTakeKernelWithString, TakeString) {
                                      "[2, 5]", &arr));
 }
 
+TEST(TestTakeKernelString, Random) {
+  DoRandomTakeTests<StringType>(
+      [](int64_t length, double null_probability, random::RandomArrayGenerator* rng) {
+        return rng->String(length, 0, 32, null_probability);
+      });
+  DoRandomTakeTests<LargeStringType>(
+      [](int64_t length, double null_probability, random::RandomArrayGenerator* rng) {
+        return rng->LargeString(length, 0, 32, null_probability);
+      });

Review comment:
       Do you think you can add tests for sliced arrays with a non-0 offset? Both for primitive types and string-like types.

##########
File path: cpp/src/arrow/compute/kernels/vector_take_test.cc
##########
@@ -72,6 +73,120 @@ void AssertTakeBoolean(const std::string& values, const std::string& indices,
   CheckTake(boolean(), values, indices, expected);
 }
 
+template <typename ValuesType, typename IndexType>
+void ValidateTakeImpl(const std::shared_ptr<Array>& values,
+                      const std::shared_ptr<Array>& indices,
+                      const std::shared_ptr<Array>& result) {
+  using ValuesArrayType = typename TypeTraits<ValuesType>::ArrayType;
+  using IndexArrayType = typename TypeTraits<IndexType>::ArrayType;
+  auto typed_values = checked_pointer_cast<ValuesArrayType>(values);
+  auto typed_result = checked_pointer_cast<ValuesArrayType>(result);
+  auto typed_indices = checked_pointer_cast<IndexArrayType>(indices);
+  for (int64_t i = 0; i < indices->length(); ++i) {
+    if (typed_indices->IsNull(i) || typed_values->IsNull(typed_indices->Value(i))) {
+      ASSERT_TRUE(result->IsNull(i));
+      continue;
+    }
+    ASSERT_EQ(typed_result->GetView(i), typed_values->GetView(typed_indices->Value(i)))

Review comment:
       Also add `ASSERT_FALSE(result->IsNull(i))`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642561522


   I can confirm a performance increase here (2x to 5x faster) as well as a decrease in code size (`libarrow.so` has 15% less code!).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r439116396



##########
File path: cpp/src/arrow/compute/kernels/vector_take.cc
##########
@@ -15,67 +15,768 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#include <algorithm>
+#include <limits>
+#include <type_traits>
+
 #include "arrow/array/array_base.h"
+#include "arrow/array/builder_primitive.h"
 #include "arrow/array/concatenate.h"
+#include "arrow/array/data.h"
+#include "arrow/buffer_builder.h"
 #include "arrow/compute/api_vector.h"
 #include "arrow/compute/kernels/common.h"
-#include "arrow/compute/kernels/vector_selection_internal.h"
 #include "arrow/record_batch.h"
 #include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
 
 namespace arrow {
+
+using internal::BitBlockCount;
+using internal::BitmapReader;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
 namespace compute {
 namespace internal {
 
-struct TakeState : public KernelState {
-  explicit TakeState(const TakeOptions& options) : options(options) {}
-  TakeOptions options;
+using TakeState = OptionsWrapper<TakeOptions>;
+
+// ----------------------------------------------------------------------
+// Implement optimized take for primitive types from boolean to 1/2/4/8-byte
+// C-type based types. Use common implementation for every byte width and only
+// generate code for unsigned integer indices, since after boundschecking to
+// check for negative numbers in the indices we can safely reinterpret_cast
+// signed integers as unsigned.
+
+struct PrimitiveTakeArgs {
+  const uint8_t* values;
+  const uint8_t* values_bitmap = nullptr;
+  int values_bit_width;
+  int64_t values_length;
+  int64_t values_offset;
+  int64_t values_null_count;
+  const uint8_t* indices;
+  const uint8_t* indices_bitmap = nullptr;
+  int indices_bit_width;
+  int64_t indices_length;
+  int64_t indices_offset;
+  int64_t indices_null_count;
 };
 
-std::unique_ptr<KernelState> InitTake(KernelContext*, const KernelInitArgs& args) {
-  // NOTE: TakeOptions are currently unused, but we pass it through anyway
-  auto take_options = static_cast<const TakeOptions*>(args.options);
-  return std::unique_ptr<KernelState>(new TakeState{*take_options});
+// Reduce code size by dealing with the unboxing of the kernel inputs once
+// rather than duplicating compiled code to do all these in each kernel.
+PrimitiveTakeArgs GetPrimitiveTakeArgs(const ExecBatch& batch) {
+  PrimitiveTakeArgs args;
+
+  const ArrayData& arg0 = *batch[0].array();
+  const ArrayData& arg1 = *batch[1].array();
+
+  // Values
+  args.values_bit_width = checked_cast<const FixedWidthType&>(*arg0.type).bit_width();
+  args.values = arg0.buffers[1]->data();
+  if (args.values_bit_width > 1) {
+    args.values += arg0.offset * args.values_bit_width / 8;
+  }
+  args.values_length = arg0.length;
+  args.values_offset = arg0.offset;
+  args.values_null_count = arg0.GetNullCount();
+  if (arg0.buffers[0]) {
+    args.values_bitmap = arg0.buffers[0]->data();
+  }
+
+  // Indices
+  args.indices_bit_width = checked_cast<const FixedWidthType&>(*arg1.type).bit_width();
+  args.indices = arg1.buffers[1]->data() + arg1.offset * args.indices_bit_width / 8;
+  args.indices_length = arg1.length;
+  args.indices_offset = arg1.offset;
+  args.indices_null_count = arg1.GetNullCount();
+  if (arg1.buffers[0]) {
+    args.indices_bitmap = arg1.buffers[0]->data();
+  }
+
+  return args;
 }
 
-template <typename ValueType, typename IndexType>
-struct TakeFunctor {
-  using ValueArrayType = typename TypeTraits<ValueType>::ArrayType;
-  using IndexArrayType = typename TypeTraits<IndexType>::ArrayType;
-  using IS = ArrayIndexSequence<IndexType>;
+/// \brief The Take implementation for primitive (fixed-width) types does not
+/// use the logical Arrow type but rather the physical C type. This way we
+/// only generate one take function for each byte width.
+///
+/// This function assumes that the indices have been boundschecked.
+template <typename IndexCType, typename ValueCType>
+struct PrimitiveTakeImpl {
+  static void Exec(const PrimitiveTakeArgs& args, Datum* out_datum) {
+    auto values = reinterpret_cast<const ValueCType*>(args.values);
+    auto values_bitmap = args.values_bitmap;
+    auto values_offset = args.values_offset;
+
+    auto indices = reinterpret_cast<const IndexCType*>(args.indices);
+    auto indices_bitmap = args.indices_bitmap;
+    auto indices_offset = args.indices_offset;
+
+    ArrayData* out_arr = out_datum->mutable_array();
+    auto out = out_arr->GetMutableValues<ValueCType>(1);
+    auto out_bitmap = out_arr->buffers[0]->mutable_data();
+    auto out_offset = out_arr->offset;
+
+    // If either the values or indices have nulls, we preemptively zero out the
+    // out validity bitmap so that we don't have to use ClearBit in each
+    // iteration for nulls.
+    if (args.values_null_count > 0 || args.indices_null_count > 0) {
+      BitUtil::SetBitsTo(out_bitmap, out_offset, args.indices_length, false);
+    }
+
+    OptionalBitBlockCounter indices_bit_counter(indices_bitmap, indices_offset,
+                                                args.indices_length);
+    int64_t position = 0;
+    int64_t valid_count = 0;
+    while (position < args.indices_length) {
+      BitBlockCount block = indices_bit_counter.NextBlock();
+      if (args.values_null_count == 0) {
+        // Values are never null, so things are easier
+        valid_count += block.popcount;
+        if (block.popcount == block.length) {
+          // Fastest path: neither values nor index nulls
+          BitUtil::SetBitsTo(out_bitmap, out_offset + position, block.length, true);
+          for (int64_t i = 0; i < block.length; ++i) {
+            out[position] = values[indices[position]];
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some indices but not all are null
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              out[position] = values[indices[position]];
+            } else {
+              out[position] = ValueCType{};
+            }
+            ++position;
+          }
+        } else {

Review comment:
       I'll make this change in my Filter patch, thanks for catching this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642856297


   Same issue as jemalloc I think. I really hope that the static lib issue will be addressed for 1.0. I can try to do it if no one else can work on it but I see that as a last resort as there are other high value things I'm trying to get done. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642742813


   Here's an appveyor build https://ci.appveyor.com/project/wesm/arrow/builds/33463261. Will merge this shortly


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r437133332



##########
File path: cpp/src/arrow/compute/kernels/vector_take.cc
##########
@@ -38,46 +52,758 @@ std::unique_ptr<KernelState> InitTake(KernelContext*, const KernelInitArgs& args
   return std::unique_ptr<KernelState>(new TakeState{*take_options});
 }
 
-template <typename ValueType, typename IndexType>
-struct TakeFunctor {
-  using ValueArrayType = typename TypeTraits<ValueType>::ArrayType;
-  using IndexArrayType = typename TypeTraits<IndexType>::ArrayType;
-  using IS = ArrayIndexSequence<IndexType>;
+namespace {
+
+template <typename IndexCType, bool IsSigned = std::is_signed<IndexCType>::value>
+Status BoundscheckImpl(const ArrayData& indices, IndexCType upper_limit) {
+  // For unsigned integers, if the values array is larger than the maximum
+  // index value (e.g. especially for UINT8 / UINT16), then there is no need to
+  // boundscheck.
+  if (!IsSigned && upper_limit >= std::numeric_limits<IndexCType>::max()) {
+    return Status::OK();
+  }
 
-  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
-    ValueArrayType values(batch[0].array());
-    IndexArrayType indices(batch[1].array());
-    std::shared_ptr<Array> result;
-    KERNEL_RETURN_IF_ERROR(ctx, Select(ctx, values, IS(indices), &result));
-    out->value = result->data();
+  const IndexCType* indices_data = indices.GetValues<IndexCType>(1);
+  const uint8_t* bitmap = nullptr;
+  if (indices.buffers[0]) {
+    bitmap = indices.buffers[0]->data();
   }
+  OptionalBitBlockCounter indices_bit_counter(bitmap, indices.offset, indices.length);
+  int64_t position = 0;
+  while (position < indices.length) {
+    BitBlockCount block = indices_bit_counter.NextBlock();
+    bool block_out_of_bounds = false;
+    if (block.popcount == block.length) {
+      // Fast path: branchless
+      for (int64_t i = 0; i < block.length; ++i) {
+        block_out_of_bounds |=
+            ((IsSigned && indices_data[i] < 0) || indices_data[i] >= upper_limit);
+      }
+    } else if (block.popcount > 0) {
+      // Indices have nulls, must only boundscheck non-null values
+      for (int64_t i = 0; i < block.length; ++i) {
+        if (BitUtil::GetBit(bitmap, indices.offset + position + i)) {
+          block_out_of_bounds |=
+              ((IsSigned && indices_data[i] < 0) || indices_data[i] >= upper_limit);
+        }
+      }
+    }
+    if (block_out_of_bounds) {
+      // TODO: Find the out of bounds index in the block
+      return Status::IndexError("Take indices out of bounds");
+    }
+    indices_data += block.length;
+    position += block.length;
+  }
+  return Status::OK();
+}
+
+/// \brief Branchless boundschecking of the indices. Processes batches of
+/// indices at a time and shortcircuits when encountering an out-of-bounds
+/// index in a batch
+Status Boundscheck(const ArrayData& indices, int64_t upper_limit) {
+  switch (indices.type->id()) {
+    case Type::INT8:
+      return BoundscheckImpl<int8_t>(indices, upper_limit);
+    case Type::INT16:
+      return BoundscheckImpl<int16_t>(indices, upper_limit);
+    case Type::INT32:
+      return BoundscheckImpl<int32_t>(indices, upper_limit);
+    case Type::INT64:
+      return BoundscheckImpl<int64_t>(indices, upper_limit);
+    case Type::UINT8:
+      return BoundscheckImpl<uint8_t>(indices, upper_limit);
+    case Type::UINT16:
+      return BoundscheckImpl<uint16_t>(indices, upper_limit);
+    case Type::UINT32:
+      return BoundscheckImpl<uint32_t>(indices, upper_limit);
+    case Type::UINT64:
+      return BoundscheckImpl<uint64_t>(indices, upper_limit);
+    default:
+      return Status::Invalid("Invalid index type for boundschecking");
+  }
+}
+
+}  // namespace
+
+// ----------------------------------------------------------------------
+// Implement optimized take for primitive types from boolean to 1/2/4/8-byte
+// C-type based types. Use common implementation for every byte width and only
+// generate code for unsigned integer indices, since after boundschecking to
+// check for negative numbers the indices we can safely reinterpret_cast signed
+// integers as unsigned.
+
+struct PrimitiveTakeArgs {
+  const uint8_t* values;
+  const uint8_t* values_bitmap = nullptr;
+  int values_bit_width;
+  int64_t values_length;
+  int64_t values_offset;
+  int64_t values_null_count;
+  const uint8_t* indices;
+  const uint8_t* indices_bitmap = nullptr;
+  int indices_bit_width;
+  int64_t indices_length;
+  int64_t indices_offset;
+  int64_t indices_null_count;
 };
 
-struct TakeKernelVisitor {
-  TakeKernelVisitor(const DataType& value_type, const DataType& index_type)
-      : value_type(value_type), index_type(index_type) {}
+// Reduce code size by dealing with the unboxing of the kernel inputs once
+// rather than duplicating compiled code to do all these in each kernel.
+PrimitiveTakeArgs GetPrimitiveTakeArgs(const ExecBatch& batch) {
+  PrimitiveTakeArgs args;
 
-  template <typename Type>
-  Status Visit(const Type&) {
-    this->result = codegen::Integer<TakeFunctor, Type>(index_type.id());
-    return Status::OK();
+  const ArrayData& arg0 = *batch[0].array();
+  const ArrayData& arg1 = *batch[1].array();
+
+  // Values
+  args.values_bit_width = static_cast<const FixedWidthType&>(*arg0.type).bit_width();
+  args.values = arg0.buffers[1]->data();
+  if (args.values_bit_width > 1) {
+    args.values += arg0.offset * args.values_bit_width / 8;
+  }
+  args.values_length = arg0.length;
+  args.values_offset = arg0.offset;
+  args.values_null_count = arg0.GetNullCount();
+  if (arg0.buffers[0]) {
+    args.values_bitmap = arg0.buffers[0]->data();
   }
 
-  Status Create() { return VisitTypeInline(value_type, this); }
+  // Indices
+  args.indices_bit_width = static_cast<const FixedWidthType&>(*arg1.type).bit_width();
+  args.indices = arg1.buffers[1]->data() + arg1.offset * args.indices_bit_width / 8;
+  args.indices_length = arg1.length;
+  args.indices_offset = arg1.offset;
+  args.indices_null_count = arg1.GetNullCount();
+  if (arg1.buffers[0]) {
+    args.indices_bitmap = arg1.buffers[0]->data();
+  }
+
+  return args;
+}
+
+/// \brief The Take implementation for primitive (fixed-width) types does not
+/// use the logical Arrow type but rather then physical C type. This way we
+/// only generate one take function for each byte width.
+///
+/// This function assumes that the indices have been boundschecked.
+template <typename IndexCType, typename ValueCType>
+struct PrimitiveTakeImpl {
+  static void Exec(const PrimitiveTakeArgs& args, Datum* out_datum) {
+    auto values = reinterpret_cast<const ValueCType*>(args.values);
+    auto values_bitmap = args.values_bitmap;
+    auto values_offset = args.values_offset;
+
+    auto indices = reinterpret_cast<const IndexCType*>(args.indices);
+    auto indices_bitmap = args.indices_bitmap;
+    auto indices_offset = args.indices_offset;
 
-  const DataType& value_type;
-  const DataType& index_type;
-  ArrayKernelExec result;
+    ArrayData* out_arr = out_datum->mutable_array();
+    auto out = out_arr->GetMutableValues<ValueCType>(1);
+    auto out_bitmap = out_arr->buffers[0]->mutable_data();
+    auto out_offset = out_arr->offset;
+
+    // If either the values or indices have nulls, we preemptively zero out the
+    // out validity bitmap so that we don't have to use ClearBit in each
+    // iteration for nulls.
+    if (args.values_null_count > 0 || args.indices_null_count > 0) {
+      BitUtil::SetBitsTo(out_bitmap, out_offset, args.indices_length, false);
+    }
+
+    OptionalBitBlockCounter indices_bit_counter(indices_bitmap, indices_offset,
+                                                args.indices_length);
+    int64_t position = 0;
+    int64_t valid_count = 0;
+    while (true) {
+      BitBlockCount block = indices_bit_counter.NextBlock();
+      if (block.length == 0) {
+        // All indices processed.
+        break;
+      }
+      if (args.values_null_count == 0) {
+        // Values are never null, so things are easier
+        valid_count += block.popcount;
+        if (block.popcount == block.length) {
+          // Fastest path: neither values nor index nulls
+          BitUtil::SetBitsTo(out_bitmap, out_offset + position, block.length, true);
+          for (int64_t i = 0; i < block.length; ++i) {
+            out[position] = values[indices[position]];
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some indices but not all are null
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              out[position] = values[indices[position]];
+            }
+            ++position;
+          }
+        }
+      } else {
+        // Values have nulls, so we must do random access into the values bitmap
+        if (block.popcount == block.length) {
+          // Faster path: indices are not null but values may be
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(values_bitmap, values_offset + indices[position])) {
+              // value is not null
+              out[position] = values[indices[position]];
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              ++valid_count;
+            }
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some but not all indices are null. Since we are doing
+          // random access in general we have to check the value nullness one by
+          // one.
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              if (BitUtil::GetBit(values_bitmap, values_offset + indices[position])) {
+                // value is not null
+                out[position] = values[indices[position]];
+                BitUtil::SetBit(out_bitmap, out_offset + position);
+                ++valid_count;
+              }
+            }
+            ++position;
+          }
+        }
+      }
+    }
+    out_arr->null_count = out_arr->length - valid_count;
+  }
 };
 
-Status GetTakeKernel(const DataType& value_type, const DataType& index_type,
-                     ArrayKernelExec* exec) {
-  TakeKernelVisitor visitor(value_type, index_type);
-  RETURN_NOT_OK(visitor.Create());
-  *exec = visitor.result;
+template <typename IndexCType>
+struct BooleanTakeImpl {
+  static void Exec(const PrimitiveTakeArgs& args, Datum* out_datum) {
+    auto values = args.values;
+    auto values_bitmap = args.values_bitmap;
+    auto values_offset = args.values_offset;
+
+    auto indices = reinterpret_cast<const IndexCType*>(args.indices);
+    auto indices_bitmap = args.indices_bitmap;
+    auto indices_offset = args.indices_offset;
+
+    ArrayData* out_arr = out_datum->mutable_array();
+    auto out = out_arr->buffers[1]->mutable_data();
+    auto out_bitmap = out_arr->buffers[0]->mutable_data();
+    auto out_offset = out_arr->offset;
+
+    // If either the values or indices have nulls, we preemptively zero out the
+    // out validity bitmap so that we don't have to use ClearBit in each
+    // iteration for nulls.
+    if (args.values_null_count > 0 || args.indices_null_count > 0) {
+      BitUtil::SetBitsTo(out_bitmap, out_offset, args.indices_length, false);
+    }
+
+    auto PlaceDataBit = [&](int64_t loc, IndexCType index) {
+      BitUtil::SetBitTo(out, out_offset + loc,
+                        BitUtil::GetBit(values, values_offset + index));
+    };
+
+    OptionalBitBlockCounter indices_bit_counter(indices_bitmap, indices_offset,
+                                                args.indices_length);
+    int64_t position = 0;
+    int64_t valid_count = 0;
+    while (true) {
+      BitBlockCount block = indices_bit_counter.NextBlock();
+      if (block.length == 0) {
+        // All indices processed.
+        break;
+      }
+      if (args.values_null_count == 0) {
+        // Values are never null, so things are easier
+        valid_count += block.popcount;
+        if (block.popcount == block.length) {
+          // Fastest path: neither values nor index nulls
+          BitUtil::SetBitsTo(out_bitmap, out_offset + position, block.length, true);
+          for (int64_t i = 0; i < block.length; ++i) {
+            PlaceDataBit(position, indices[position]);
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some but not all indices are null
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              PlaceDataBit(position, indices[position]);
+            }
+            ++position;
+          }
+        }
+      } else {
+        // Values have nulls, so we must do random access into the values bitmap
+        if (block.popcount == block.length) {
+          // Faster path: indices are not null but values may be
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(values_bitmap, values_offset + indices[position])) {
+              // value is not null
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              PlaceDataBit(position, indices[position]);
+              ++valid_count;
+            }
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some but not all indices are null. Since we are doing
+          // random access in general we have to check the value nullness one by
+          // one.
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              if (BitUtil::GetBit(values_bitmap, values_offset + indices[position])) {
+                // value is not null
+                PlaceDataBit(position, indices[position]);
+                BitUtil::SetBit(out_bitmap, out_offset + position);
+                ++valid_count;
+              }
+            }
+            ++position;
+          }
+        }
+      }
+    }
+    out_arr->null_count = out_arr->length - valid_count;
+  }
+};
+
+template <template <typename...> class TakeImpl, typename... Args>
+void TakeIndexDispatch(const PrimitiveTakeArgs& args, Datum* out) {
+  // With the simplifying assumption that boundschecking has taken place
+  // already at a higher level, we can now assume that the index values are all
+  // non-negative. Thus, we can interpret signed integers as unsigned and avoid
+  // having to generate double the amount of binary code to handle each integer
+  // with.

Review comment:
       typo: width

##########
File path: cpp/src/arrow/util/bit_block_counter.h
##########
@@ -68,6 +75,39 @@ class ARROW_EXPORT BitBlockCounter {
   int64_t offset_;
 };
 
+/// \brief A tool to iterate through a possibly non-existent validity bitmap,
+/// to allow us to write one code path for both the with-nulls and no-nulls
+/// cases without giving up a lot of performance
+class OptionalBitBlockCounter {
+ public:
+  OptionalBitBlockCounter(const uint8_t* validity_bitmap, int64_t offset, int64_t length);
+
+  OptionalBitBlockCounter(const std::shared_ptr<Buffer>& validity_bitmap, int64_t offset,
+                          int64_t length);
+
+  /// Return block count for next word when the bitmap is

Review comment:
       Complete

##########
File path: cpp/src/arrow/compute/benchmark_util.h
##########
@@ -77,12 +77,18 @@ struct RegressionArgs {
   const int64_t size;
 
   // proportion of nulls in generated arrays
-  const double null_proportion;
+  double null_proportion;
 
   explicit RegressionArgs(benchmark::State& state)
       : size(state.range(0)),
-        null_proportion(std::min(1., 1. / static_cast<double>(state.range(1)))),
-        state_(state) {}
+        null_proportion(),
+        state_(state) {
+    if (state.range(1) == 0) {
+      this->null_proportion = 0.0;
+    } else {
+      this->null_proportion = std::min(1., 1. / static_cast<double>(state.range(1)));
+    }
+  }
 

Review comment:
       Sure, will do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7382:
URL: https://github.com/apache/arrow/pull/7382


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642285703


   I am preparing another large patch that uses this branch as a base so this patch will need to be reviewed and merged before the next patch (providing a streamlined Filter implementation) can be reviewed. CI is passing -- the only failure is an out-of-space error on the 390x box


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-641025248


   https://issues.apache.org/jira/browse/ARROW-5760


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642690573


   Please go ahead 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642848987


   BTW I ran benchmarks for this with MSVC 2017 with and without mimalloc and mimalloc has a pretty big impact, we should definitely endeavor to ship mimalloc in all of our Windows binaries in the next release
   
   https://gist.github.com/wesm/5b0841e583a27a8dc71f951b30475015/revisions?diff=split


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642848987


   BTW I ran benchmarks for this with MSVC 2017 with and without mimalloc and mimalloc has a pretty big impact, we should definitely endeavor to ship mimalloc in all of our Windows binaries in the next release
   
   https://gist.github.com/wesm/5b0841e583a27a8dc71f951b30475015/revisions?diff=split


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642848987


   BTW I ran benchmarks for this with MSVC 2017 with and without mimalloc and mimalloc has a pretty big impact, we should definitely endeavor to ship mimalloc in all of our Windows binaries in the next release
   
   https://gist.github.com/wesm/5b0841e583a27a8dc71f951b30475015/revisions?diff=split


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-641630347


   Appveyor is failing because of https://github.com/apache/arrow/commit/b058cf0d1c26ad7984c104bb84322cc7dcc66f00
   
   I opened https://issues.apache.org/jira/browse/ARROW-9085
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-641720988


   noticed a few small nits.  Somebody more familiar with Take should review this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642852731


   Does mimalloc have the same special build constraints like jemalloc such that we have to build it bundled? (Also, are we going to get that jemalloc static lib symbol gluing done for 1.0?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r438921113



##########
File path: cpp/src/arrow/compute/kernels/vector_take.cc
##########
@@ -15,67 +15,768 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#include <algorithm>
+#include <limits>
+#include <type_traits>
+
 #include "arrow/array/array_base.h"
+#include "arrow/array/builder_primitive.h"
 #include "arrow/array/concatenate.h"
+#include "arrow/array/data.h"
+#include "arrow/buffer_builder.h"
 #include "arrow/compute/api_vector.h"
 #include "arrow/compute/kernels/common.h"
-#include "arrow/compute/kernels/vector_selection_internal.h"
 #include "arrow/record_batch.h"
 #include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
 
 namespace arrow {
+
+using internal::BitBlockCount;
+using internal::BitmapReader;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
 namespace compute {
 namespace internal {
 
-struct TakeState : public KernelState {
-  explicit TakeState(const TakeOptions& options) : options(options) {}
-  TakeOptions options;
+using TakeState = OptionsWrapper<TakeOptions>;
+
+// ----------------------------------------------------------------------
+// Implement optimized take for primitive types from boolean to 1/2/4/8-byte
+// C-type based types. Use common implementation for every byte width and only
+// generate code for unsigned integer indices, since after boundschecking to
+// check for negative numbers in the indices we can safely reinterpret_cast
+// signed integers as unsigned.
+
+struct PrimitiveTakeArgs {
+  const uint8_t* values;
+  const uint8_t* values_bitmap = nullptr;
+  int values_bit_width;
+  int64_t values_length;
+  int64_t values_offset;
+  int64_t values_null_count;
+  const uint8_t* indices;
+  const uint8_t* indices_bitmap = nullptr;
+  int indices_bit_width;
+  int64_t indices_length;
+  int64_t indices_offset;
+  int64_t indices_null_count;
 };
 
-std::unique_ptr<KernelState> InitTake(KernelContext*, const KernelInitArgs& args) {
-  // NOTE: TakeOptions are currently unused, but we pass it through anyway
-  auto take_options = static_cast<const TakeOptions*>(args.options);
-  return std::unique_ptr<KernelState>(new TakeState{*take_options});
+// Reduce code size by dealing with the unboxing of the kernel inputs once
+// rather than duplicating compiled code to do all these in each kernel.
+PrimitiveTakeArgs GetPrimitiveTakeArgs(const ExecBatch& batch) {
+  PrimitiveTakeArgs args;
+
+  const ArrayData& arg0 = *batch[0].array();
+  const ArrayData& arg1 = *batch[1].array();
+
+  // Values
+  args.values_bit_width = checked_cast<const FixedWidthType&>(*arg0.type).bit_width();
+  args.values = arg0.buffers[1]->data();
+  if (args.values_bit_width > 1) {
+    args.values += arg0.offset * args.values_bit_width / 8;
+  }
+  args.values_length = arg0.length;
+  args.values_offset = arg0.offset;
+  args.values_null_count = arg0.GetNullCount();
+  if (arg0.buffers[0]) {
+    args.values_bitmap = arg0.buffers[0]->data();
+  }
+
+  // Indices
+  args.indices_bit_width = checked_cast<const FixedWidthType&>(*arg1.type).bit_width();
+  args.indices = arg1.buffers[1]->data() + arg1.offset * args.indices_bit_width / 8;
+  args.indices_length = arg1.length;
+  args.indices_offset = arg1.offset;
+  args.indices_null_count = arg1.GetNullCount();
+  if (arg1.buffers[0]) {
+    args.indices_bitmap = arg1.buffers[0]->data();
+  }
+
+  return args;
 }
 
-template <typename ValueType, typename IndexType>
-struct TakeFunctor {
-  using ValueArrayType = typename TypeTraits<ValueType>::ArrayType;
-  using IndexArrayType = typename TypeTraits<IndexType>::ArrayType;
-  using IS = ArrayIndexSequence<IndexType>;
+/// \brief The Take implementation for primitive (fixed-width) types does not
+/// use the logical Arrow type but rather the physical C type. This way we
+/// only generate one take function for each byte width.
+///
+/// This function assumes that the indices have been boundschecked.
+template <typename IndexCType, typename ValueCType>
+struct PrimitiveTakeImpl {
+  static void Exec(const PrimitiveTakeArgs& args, Datum* out_datum) {
+    auto values = reinterpret_cast<const ValueCType*>(args.values);
+    auto values_bitmap = args.values_bitmap;
+    auto values_offset = args.values_offset;
+
+    auto indices = reinterpret_cast<const IndexCType*>(args.indices);
+    auto indices_bitmap = args.indices_bitmap;
+    auto indices_offset = args.indices_offset;
+
+    ArrayData* out_arr = out_datum->mutable_array();
+    auto out = out_arr->GetMutableValues<ValueCType>(1);
+    auto out_bitmap = out_arr->buffers[0]->mutable_data();
+    auto out_offset = out_arr->offset;
+
+    // If either the values or indices have nulls, we preemptively zero out the
+    // out validity bitmap so that we don't have to use ClearBit in each
+    // iteration for nulls.
+    if (args.values_null_count > 0 || args.indices_null_count > 0) {
+      BitUtil::SetBitsTo(out_bitmap, out_offset, args.indices_length, false);
+    }
+
+    OptionalBitBlockCounter indices_bit_counter(indices_bitmap, indices_offset,
+                                                args.indices_length);
+    int64_t position = 0;
+    int64_t valid_count = 0;
+    while (position < args.indices_length) {
+      BitBlockCount block = indices_bit_counter.NextBlock();
+      if (args.values_null_count == 0) {
+        // Values are never null, so things are easier
+        valid_count += block.popcount;
+        if (block.popcount == block.length) {
+          // Fastest path: neither values nor index nulls
+          BitUtil::SetBitsTo(out_bitmap, out_offset + position, block.length, true);
+          for (int64_t i = 0; i < block.length; ++i) {
+            out[position] = values[indices[position]];
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some indices but not all are null
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              out[position] = values[indices[position]];
+            } else {
+              out[position] = ValueCType{};
+            }
+            ++position;
+          }
+        } else {

Review comment:
       ```suggestion
           } else {
             memset(out + position, 0, sizeof(ValueCType) * block.length);
   ```
   All other paths set the value under a null to 0




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642852731


   Does mimalloc have the same special build constraints like jemalloc such that we have to build it bundled? (Also, are we going to get that jemalloc static lib symbol gluing done for 1.0?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-641027598






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642285703


   I am preparing another large patch that uses this branch as a base so this patch will need to be reviewed and merged before the next patch (providing a streamlined Filter implementation) can be reviewed. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642742047


   @pitrou thanks for the fixes + improvements!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r437859624



##########
File path: cpp/src/arrow/util/int_util_test.cc
##########
@@ -382,5 +386,102 @@ TEST(TransposeInts, Int8ToInt64) {
   ASSERT_EQ(dest, std::vector<int64_t>({2222, 4444, 6666, 1111, 4444, 3333}));
 }
 
+void BoundscheckPasses(const std::shared_ptr<DataType>& type,

Review comment:
       nit: BoundsCheck




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642856297


   Same issue as jemalloc I think. I really hope that the static lib issue will be addressed for 1.0. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-641612120


   I added a bunch more unit tests, I think I'm done with this aside from CI fixes and will move on to working on Filter


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r437859793



##########
File path: cpp/src/arrow/util/bit_block_counter_test.cc
##########
@@ -246,5 +246,28 @@ TEST(TestBinaryBitBlockCounter, NextAndWord) {
   }
 }
 
+TEST(TestOptionalBitBlockCounter, Basics) {
+  const int64_t nbytes = 1024;
+  auto bitmap = *AllocateBitmap(nbytes * 8);
+  random_bytes(nbytes, 0, bitmap->mutable_data());
+
+  OptionalBitBlockCounter optional_counter(bitmap, 0, nbytes * 8);
+  BitBlockCounter bit_counter(bitmap->data(), 0, nbytes * 8);
+
+  while (true) {

Review comment:
       nit: do/while?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642852731


   Does mimalloc have the same special build constraints like jemalloc such that we have to build it bundled? (Also, are we going to get that jemalloc static lib symbol gluing done for 1.0?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jianxind commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r437195110



##########
File path: cpp/src/arrow/compute/benchmark_util.h
##########
@@ -77,12 +77,18 @@ struct RegressionArgs {
   const int64_t size;
 
   // proportion of nulls in generated arrays
-  const double null_proportion;
+  double null_proportion;
 
   explicit RegressionArgs(benchmark::State& state)
       : size(state.range(0)),
-        null_proportion(std::min(1., 1. / static_cast<double>(state.range(1)))),
-        state_(state) {}
+        null_proportion(),
+        state_(state) {
+    if (state.range(1) == 0) {
+      this->null_proportion = 0.0;
+    } else {
+      this->null_proportion = std::min(1., 1. / static_cast<double>(state.range(1)));
+    }
+  }
 

Review comment:
       Can the new zero null percent added to pre-built list of BenchmarkSetArgsWithSizes also?
     void BenchmarkSetArgsWithSizes(benchmark::internal::Benchmark* bench,
        ...
        for (auto nulls : std::vector<ArgsType>({10000, 1000, 100, 50, 10, 1})), add a 0?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r438824681



##########
File path: cpp/src/arrow/util/bit_block_counter.h
##########
@@ -17,14 +17,21 @@
 
 #pragma once
 
+#include <algorithm>
 #include <cstdint>
+#include <limits>
+#include <memory>
 
+#include "arrow/array/array_base.h"

Review comment:
       Don't think so




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-641433457


   OK, I have added more unit tests for the boundchecking code and added more rigorous unit testing for string/fixed-size-binary types in vector_take_test.cc. I'm going to get the test suite passing again but this can be reviewed


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r438823668



##########
File path: cpp/src/arrow/compute/kernels/vector_take_test.cc
##########
@@ -211,13 +306,63 @@ TYPED_TEST(TestTakeKernelWithString, TakeString) {
                                      "[2, 5]", &arr));
 }
 
+TEST(TestTakeKernelString, Random) {
+  DoRandomTakeTests<StringType>(
+      [](int64_t length, double null_probability, random::RandomArrayGenerator* rng) {
+        return rng->String(length, 0, 32, null_probability);
+      });
+  DoRandomTakeTests<LargeStringType>(
+      [](int64_t length, double null_probability, random::RandomArrayGenerator* rng) {
+        return rng->LargeString(length, 0, 32, null_probability);
+      });

Review comment:
       Good idea 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642848987


   BTW I ran benchmarks for this with MSVC 2017 with and without mimalloc and mimalloc has a pretty big impact, we should definitely endeavor to ship mimalloc in all of our Windows binaries in the next release
   
   https://gist.github.com/wesm/5b0841e583a27a8dc71f951b30475015/revisions?diff=split


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642039038


   I noticed some bugs while reviewing, I'll push some changes here in a few minutes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-641433457


   OK, I have added more unit tests for the boundchecking code and added more rigorous unit testing for string/fixed-size-binary types in vector_take_test.cc. I'm going to get the test suite passing again but this can be reviewed
   
   Latest benchmark run:
   
   * Old code: https://gist.github.com/wesm/85ed9b59a24f2f63c9eb3b2bb12bf364
   * New code: https://gist.github.com/wesm/1a83bc6bb3cb8c1b87943cd49a0ac3ff


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642736534


   Travis-CI: https://travis-ci.org/github/wesm/arrow/builds/697263135


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#issuecomment-642689244


   I'm going to push changes on this PR, I think. Please hold on :-).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7382: ARROW-5760: [C++] New compute::Take implementation for better performance, faster dispatch, smaller code size / faster compilation

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7382:
URL: https://github.com/apache/arrow/pull/7382#discussion_r437580946



##########
File path: cpp/src/arrow/compute/kernels/vector_take.cc
##########
@@ -38,44 +55,715 @@ std::unique_ptr<KernelState> InitTake(KernelContext*, const KernelInitArgs& args
   return std::unique_ptr<KernelState>(new TakeState{*take_options});
 }
 
-template <typename ValueType, typename IndexType>
-struct TakeFunctor {
-  using ValueArrayType = typename TypeTraits<ValueType>::ArrayType;
-  using IndexArrayType = typename TypeTraits<IndexType>::ArrayType;
-  using IS = ArrayIndexSequence<IndexType>;
+namespace {}  // namespace
+
+// ----------------------------------------------------------------------
+// Implement optimized take for primitive types from boolean to 1/2/4/8-byte
+// C-type based types. Use common implementation for every byte width and only
+// generate code for unsigned integer indices, since after boundschecking to
+// check for negative numbers the indices we can safely reinterpret_cast signed
+// integers as unsigned.
+
+struct PrimitiveTakeArgs {
+  const uint8_t* values;
+  const uint8_t* values_bitmap = nullptr;
+  int values_bit_width;
+  int64_t values_length;
+  int64_t values_offset;
+  int64_t values_null_count;
+  const uint8_t* indices;
+  const uint8_t* indices_bitmap = nullptr;
+  int indices_bit_width;
+  int64_t indices_length;
+  int64_t indices_offset;
+  int64_t indices_null_count;
+};
+
+// Reduce code size by dealing with the unboxing of the kernel inputs once
+// rather than duplicating compiled code to do all these in each kernel.
+PrimitiveTakeArgs GetPrimitiveTakeArgs(const ExecBatch& batch) {
+  PrimitiveTakeArgs args;
+
+  const ArrayData& arg0 = *batch[0].array();
+  const ArrayData& arg1 = *batch[1].array();
+
+  // Values
+  args.values_bit_width = static_cast<const FixedWidthType&>(*arg0.type).bit_width();
+  args.values = arg0.buffers[1]->data();
+  if (args.values_bit_width > 1) {
+    args.values += arg0.offset * args.values_bit_width / 8;
+  }
+  args.values_length = arg0.length;
+  args.values_offset = arg0.offset;
+  args.values_null_count = arg0.GetNullCount();
+  if (arg0.buffers[0]) {
+    args.values_bitmap = arg0.buffers[0]->data();
+  }
+
+  // Indices
+  args.indices_bit_width = static_cast<const FixedWidthType&>(*arg1.type).bit_width();
+  args.indices = arg1.buffers[1]->data() + arg1.offset * args.indices_bit_width / 8;
+  args.indices_length = arg1.length;
+  args.indices_offset = arg1.offset;
+  args.indices_null_count = arg1.GetNullCount();
+  if (arg1.buffers[0]) {
+    args.indices_bitmap = arg1.buffers[0]->data();
+  }
+
+  return args;
+}
+
+/// \brief The Take implementation for primitive (fixed-width) types does not
+/// use the logical Arrow type but rather then physical C type. This way we
+/// only generate one take function for each byte width.
+///
+/// This function assumes that the indices have been boundschecked.
+template <typename IndexCType, typename ValueCType>
+struct PrimitiveTakeImpl {
+  static void Exec(const PrimitiveTakeArgs& args, Datum* out_datum) {
+    auto values = reinterpret_cast<const ValueCType*>(args.values);
+    auto values_bitmap = args.values_bitmap;
+    auto values_offset = args.values_offset;
+
+    auto indices = reinterpret_cast<const IndexCType*>(args.indices);
+    auto indices_bitmap = args.indices_bitmap;
+    auto indices_offset = args.indices_offset;
+
+    ArrayData* out_arr = out_datum->mutable_array();
+    auto out = out_arr->GetMutableValues<ValueCType>(1);
+    auto out_bitmap = out_arr->buffers[0]->mutable_data();
+    auto out_offset = out_arr->offset;
+
+    // If either the values or indices have nulls, we preemptively zero out the
+    // out validity bitmap so that we don't have to use ClearBit in each
+    // iteration for nulls.
+    if (args.values_null_count > 0 || args.indices_null_count > 0) {
+      BitUtil::SetBitsTo(out_bitmap, out_offset, args.indices_length, false);
+    }
+
+    OptionalBitBlockCounter indices_bit_counter(indices_bitmap, indices_offset,
+                                                args.indices_length);
+    int64_t position = 0;
+    int64_t valid_count = 0;
+    while (true) {
+      BitBlockCount block = indices_bit_counter.NextBlock();
+      if (block.length == 0) {
+        // All indices processed.
+        break;
+      }
+      if (args.values_null_count == 0) {
+        // Values are never null, so things are easier
+        valid_count += block.popcount;
+        if (block.popcount == block.length) {
+          // Fastest path: neither values nor index nulls
+          BitUtil::SetBitsTo(out_bitmap, out_offset + position, block.length, true);
+          for (int64_t i = 0; i < block.length; ++i) {
+            out[position] = values[indices[position]];
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some indices but not all are null
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              out[position] = values[indices[position]];
+            }
+            ++position;
+          }
+        }
+      } else {
+        // Values have nulls, so we must do random access into the values bitmap
+        if (block.popcount == block.length) {
+          // Faster path: indices are not null but values may be
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(values_bitmap, values_offset + indices[position])) {
+              // value is not null
+              out[position] = values[indices[position]];
+              BitUtil::SetBit(out_bitmap, out_offset + position);
+              ++valid_count;
+            }
+            ++position;
+          }
+        } else if (block.popcount > 0) {
+          // Slow path: some but not all indices are null. Since we are doing
+          // random access in general we have to check the value nullness one by
+          // one.
+          for (int64_t i = 0; i < block.length; ++i) {
+            if (BitUtil::GetBit(indices_bitmap, indices_offset + position)) {
+              // index is not null
+              if (BitUtil::GetBit(values_bitmap, values_offset + indices[position])) {
+                // value is not null
+                out[position] = values[indices[position]];
+                BitUtil::SetBit(out_bitmap, out_offset + position);
+                ++valid_count;
+              }
+            }
+            ++position;
+          }
+        }
+      }
+    }
+    out_arr->null_count = out_arr->length - valid_count;
+  }
+};
+
+template <typename IndexCType>
+struct BooleanTakeImpl {

Review comment:
       TODO: I will add some random data unit tests for boolean type, which are only sparsely tested in the test suite




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org