You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/27 22:38:35 UTC

[GitHub] [arrow] edponce opened a new pull request #11023: ARROW-12712: [C++] String repeat kernel

edponce opened a new pull request #11023:
URL: https://github.com/apache/arrow/pull/11023


   This PR adds the string repeat compute function named as "str_repeat". This function works on any type of string input (ASCII, UTF8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739522336



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -402,16 +401,16 @@ struct StringTransformExecBase {
     if (!input.is_valid) {
       return Status::OK();
     }
-    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
-    result->is_valid = true;
     const int64_t data_nbytes = static_cast<int64_t>(input.value->size());
-
     const int64_t output_ncodeunits_max = transform->MaxCodeunits(1, data_nbytes);
     if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
       return Status::CapacityError(
           "Result might not fit in a 32bit utf8 array, convert to large_utf8");
     }
+
     ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());

Review comment:
       This is a very good observation. A simple search through the C++ codebase shows that both patterns are used. I agree with having `nullptr` checks after `checked_cast()`. I will ask the in Zulip dev to see if this is a pattern we want to enforce. If so, then we should create JIRA.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739529736



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }

Review comment:
       I followed the existing pattern. The caveat is that StringTransforms can override `InvalidStatus()` and provide a custom error/invalid message ([see `AsciiReverseTransform`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L689)).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739607083



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);

Review comment:
       I extended the description to be more informative on the function parameters and return value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739611163



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+    const auto& binary_scalar1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    const auto input_string = binary_scalar1.value->data();
+    const auto input_ncodeunits = binary_scalar1.value->size();
+    const auto value2 = UnboxScalar<Type2>::Unbox(*scalar2);
+
+    // Calculate max number of output codeunits
+    const auto max_output_ncodeunits = transform->MaxCodeunits(input_ncodeunits, value2);

Review comment:
       The output size depends on the transform and the input encoding (binary/ASCII/UTF8). Also, the `MaxCodeunits()` does not needs to calculate the exact output size because [a resizing operation is performed at end kernel exec](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L422), but needs to allocate enough space, not less.
   
   Binary/ASCII transforms that do not change the size (uppercase, title, capitalize, etc.), [use the default `MaxCodeunits()`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L330). On the other hand, the [default `MaxCodeunits()` for UTF8 transform](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L555) allocates more for the output.
   
   Some transforms will have different estimates for the output size (as is the case in this PR so `MaxCodeunits()` is overriden). This is the first "binary string transform" implemented as such and so I decided to generalize the machinery in order to support other ones.
   
   But most importantly is to note that [many string transforms implement their own `kernel exec`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1010) and do not use `MaxCodeunits()`. Hopefully, as the variety of patterns in string transforms stabilizes, we can use consistent `kernel execs` and perform similarly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739603493



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }

Review comment:
       Maybe `InvalidSequence` is a better and more generic name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741224558



##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       https://arrow.apache.org/docs/developers/cpp/development.html#cleaning-includes-with-include-what-you-use-iwyu




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741575436



##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       `strrep()` is a function in base R. There is also a function [`str_dup()`](https://stringr.tidyverse.org/reference/str_dup.html) in the popular R package **stringr** that does exactly the same thing. In the R bindings we often like to add these **stringr** variants of the functions too:
   ```suggestion
     "strrep" = "binary_repeat",
     "str_dup" = "binary_repeat"
   ```

##########
File path: r/tests/testthat/test-dplyr-funcs-string.R
##########
@@ -467,6 +467,18 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("strrep", {
+  df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
+  for (times in 0:8L) {
+    compare_dplyr_binding(
+      .input %>%
+        mutate(x = strrep(x, times)) %>%
+        collect(),
+      df
+    )
+  }
+})
+

Review comment:
       Adds a test for the `str_dup()` binding I suggested above. Also FYI you don't need the `L` after `8` because the `:` operator in R always creates integer vectors when its operands are whole numbers.
   ```suggestion
   test_that("strrep, str_dup", {
     df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
     for (times in 0:8) {
       compare_dplyr_binding(
         .input %>%
           mutate(x = strrep(x, times)) %>%
           collect(),
         df
       )
       compare_dplyr_binding(
         .input %>%
           mutate(x = str_dup(x, times)) %>%
           collect(),
         df
       )
     }
   })
   ```

##########
File path: r/tests/testthat/test-dplyr-funcs-string.R
##########
@@ -467,6 +467,18 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("strrep", {
+  df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
+  for (times in 0:8L) {
+    compare_dplyr_binding(
+      .input %>%
+        mutate(x = strrep(x, times)) %>%
+        collect(),
+      df
+    )
+  }
+})
+

Review comment:
       Adds a test for the `str_dup()` binding I suggested above. Also FYI you don't need the `L` after `8` because the `:` operator in R always creates integer vectors when its operands are whole numbers.
   ```suggestion
   test_that("strrep, str_dup", {
     df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
     for (times in 0:8) {
       compare_dplyr_binding(
         .input %>%
           mutate(x = strrep(x, times)) %>%
           collect(),
         df
       )
       compare_dplyr_binding(
         .input %>%
           mutate(x = str_dup(x, times)) %>%
           collect(),
         df
       )
     }
   })
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-960061949


   Renamed [R internal function `str_dup](https://github.com/apache/arrow/blob/master/r/R/type.R#L484)` to `duplicate_string` because it was shadowing stringr's `str_dup` and kernel binding for `binary_repeat`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731389134



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       It seems we are not able to use the `VisitBitBlocks` utilities because the current `StringBinaryTransformExecBase` implementation when processing `Array` needs to set the output string offsets (`output_string_offsets`) when traversing both non-null and null positions, and this requires the `position` being visited for both visitors. Currently, the Null visitor does not receives the `position` value as an argument.
   ```c++
   offset_type output_ncodeunits = 0;
   for (i = 0...) {
     if (!input1.IsNull(i)) {
       ...
       offset_type encoded_bytes = Transform(...);
       ...
       output_ncodeunits += encoded_bytes;
     }
     // This needs to be updated for Null/NotNull visitors
     output_string_offsets[i + 1] = output_ncodeunits;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-919376030


   Based on the [semantics of the scalar binary kernels](https://arrow.apache.org/docs/cpp/compute.html#element-wise-scalar-functions), I am adding a kernel exec generator for binary string transforms. This includes an output adapter and array iterator.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706461838



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -557,6 +558,36 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StrRepeat) {
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> repeats_and_expected{{
+      {-1, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1,
+       R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])"},
+      {3,
+       R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbb", "ɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : repeats_and_expected) {
+    auto repeat = pair.first;
+    auto expected = pair.second;
+    this->CheckVarArgs("str_repeat", {values, Datum(repeat)}, this->type(), expected);

Review comment:
       I added tests for `repeat` input of different integer types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706358970



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach

Review comment:
       Ok, will change the comment to `Repeated doubling of string`
   
   In terms of performance, based on isolated benchmarks I performed comparing several *copy* approaches, the log2 approach is faster for all cases where `nrepeats >= 4`, and for `nrepeats < 4` it was not reasonably slower than direct copies. [In my initial PR, I had an `if-else` to handle this](https://github.com/apache/arrow/pull/11023/commits/a0e327d2751137a8b7d47ad524c848eef65066ff#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2596-R2612), but thought that having the condition check for all values, in addition, to having two approaches, was not better.
   
   This circles back to some of my previous comments/ideas, that the Exec methods should provide a mechanism for selecting kernel `Transform/Call` variants based on these higher-level options. More on this very soon.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907569782


   Related to a replicate operation, there was a previous discussion in Zulip chat of having a general replicate functionality where string repeat is a particular case.
   
   Arrow already has [`MakeArrayFromScalar`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L742) and [`RepeatedArrayFactory`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L493) which use [concatenate implementation](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/concatenate.cc) internally. Can this be used in this PR? These are specifically for Array types and in kernel transform method uses raw pointers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739522336



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -402,16 +401,16 @@ struct StringTransformExecBase {
     if (!input.is_valid) {
       return Status::OK();
     }
-    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
-    result->is_valid = true;
     const int64_t data_nbytes = static_cast<int64_t>(input.value->size());
-
     const int64_t output_ncodeunits_max = transform->MaxCodeunits(1, data_nbytes);
     if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
       return Status::CapacityError(
           "Result might not fit in a 32bit utf8 array, convert to large_utf8");
     }
+
     ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());

Review comment:
       This is a very good observation. A simple search through the C++ codebase shows that both patterns are used. I agree with having `nullptr` checks after `checked_cast<...*>()`. I will ask the in Zulip dev to see if this is a pattern we want to enforce. If so, then we should create JIRA.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739603493



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }

Review comment:
       Maybe `InvalidSequence` or `InvalidInputSequence` are better and more generic names.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907624892


   You may instead be interested in two things I added recently: [ArrayBuilder::AppendScalar(const Scalar&, int64_t)](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_base.h#L123) and [ArrayBuilder::AppendArraySlice](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_base.h#L129). This would let you implement a generalized repeat without allocating and concatenating lots of intermediate arrays, and would let you preallocate the final array as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741811378



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, BinaryRepeatWithScalarRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1, R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!",
+              "$. A3", "!ɑⱤⱤow"])"},
+      {4, R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb",
+              "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ",
+              "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!",
+              "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : nrepeats_and_expected) {
+    auto num_repeat = pair.first;
+    auto expected = pair.second;
+    for (const auto& ty : IntTypes()) {
+      this->CheckVarArgs("binary_repeat",
+                         {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                         this->type(), expected);
+    }
+  }
+
+  // Negative repeat count
+  for (auto num_repeat_ : {-1, -2, -5}) {
+    auto num_repeat = *arrow::MakeScalar(int64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        Invalid, ::testing::HasSubstr("Repeat count must be a non-negative integer"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+
+  // Floating-point repeat count
+  for (auto num_repeat_ : {0.0, 1.2, -1.3}) {
+    auto num_repeat = *arrow::MakeScalar(float64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        NotImplemented, ::testing::HasSubstr("has no kernel matching input types"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+}
+
+TYPED_TEST(TestStringKernels, BinaryRepeatWithArrayRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"([null, "aAazZæÆ&", "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  for (const auto& ty : IntTypes()) {
+    auto num_repeats = ArrayFromJSON(ty, R"([100, 1, 2, 5, 2, 0, 1, 3, 2, 3])");

Review comment:
       Maybe also add a null in the num_repeats? (as that is allowed and will give a null in the result)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741135553



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeatWithScalarRepeat) {

Review comment:
       Is there a place where passing a scalar for the strings argument is tested? Is it implicit in `CheckVarArgs`?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2878,135 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1_ncodeunits * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const ArrayType2& input2) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1.total_values_length() * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const ArrayType2& input2) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  static Result<int64_t> TransformSimpleLoop(const uint8_t* input,
+                                             const int64_t input_string_ncodeunits,
+                                             const int64_t num_repeats, uint8_t* output) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static Result<int64_t> TransformDoublingString(const uint8_t* input,
+                                                 const int64_t input_string_ncodeunits,
+                                                 const int64_t num_repeats,
+                                                 uint8_t* output) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;

Review comment:
       Is there a particular reason for xoring here? I guess it's fine, but it seems like this is really a subtraction?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       Both solutions are fine to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741811378



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, BinaryRepeatWithScalarRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1, R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!",
+              "$. A3", "!ɑⱤⱤow"])"},
+      {4, R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb",
+              "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ",
+              "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!",
+              "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : nrepeats_and_expected) {
+    auto num_repeat = pair.first;
+    auto expected = pair.second;
+    for (const auto& ty : IntTypes()) {
+      this->CheckVarArgs("binary_repeat",
+                         {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                         this->type(), expected);
+    }
+  }
+
+  // Negative repeat count
+  for (auto num_repeat_ : {-1, -2, -5}) {
+    auto num_repeat = *arrow::MakeScalar(int64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        Invalid, ::testing::HasSubstr("Repeat count must be a non-negative integer"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+
+  // Floating-point repeat count
+  for (auto num_repeat_ : {0.0, 1.2, -1.3}) {
+    auto num_repeat = *arrow::MakeScalar(float64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        NotImplemented, ::testing::HasSubstr("has no kernel matching input types"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+}
+
+TYPED_TEST(TestStringKernels, BinaryRepeatWithArrayRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"([null, "aAazZæÆ&", "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  for (const auto& ty : IntTypes()) {
+    auto num_repeats = ArrayFromJSON(ty, R"([100, 1, 2, 5, 2, 0, 1, 3, 2, 3])");

Review comment:
       Maybe also add a null in the num_repeats? (as that is allowed and will give a null in the result)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706327065



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();

Review comment:
       Copy-paste side-effects.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907569782


   Related to a replicate operation, there was a [previous discussion in Zulip chat](https://ursalabs.zulipchat.com/#narrow/stream/271283-help.2Fc.2B.2B/topic/util.20to.20copy.20arrays.20to.20an.20existing.20buffer) of having a general replicate functionality where string repeat is a particular case.
   
   Arrow already has [`MakeArrayFromScalar`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L742) and [`RepeatedArrayFactory`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L493). Can this be used in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706440025



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+      int64_t i = 1;
+      for (int64_t ilen = input_string_ncodeunits; i <= (nrepeats / 2);
+           i *= 2, ilen *= 2) {
+        std::memcpy(output, output_start, ilen);
+        output += ilen;
+      }
+
+      // Epilogue remainder
+      int64_t rem = (nrepeats ^ i) * input_string_ncodeunits;
+      std::memcpy(output, output_start, rem);
+      output += rem;
+    }
+    return output - output_start;
+  }
+};
+
+template <typename Type1, typename Type2>
+using StrRepeat =
+    StringBinaryTransformExec<Type1, Type2, StrRepeatTransform<Type1, Type2>>;
+
+template <template <typename...> class ExecFunctor>

Review comment:
       The `ExecFunction` is parameterized by generator dispatcher [`GenerateInteger`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/codegen_internal.h#L1044-L1067) which sets types for both template parameters of `ExecFunctor` where for _str_repeat_ the first parameter is of `XStringType` and the second one is an `XIntXType`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739522959



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -402,16 +401,16 @@ struct StringTransformExecBase {
     if (!input.is_valid) {
       return Status::OK();
     }
-    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
-    result->is_valid = true;
     const int64_t data_nbytes = static_cast<int64_t>(input.value->size());
-
     const int64_t output_ncodeunits_max = transform->MaxCodeunits(1, data_nbytes);
     if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
       return Status::CapacityError(
           "Result might not fit in a 32bit utf8 array, convert to large_utf8");
     }
+
     ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());

Review comment:
       We could just include the check in checked_cast?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-955765038


   This PR is failing R dplyr tests ([see here](https://github.com/apache/arrow/runs/4053868784?check_suite_focus=true#step:8:17122)). [`strrep` is binded as a binary expression](https://github.com/apache/arrow/pull/11023/files#diff-ed2774950584af59273e99c303c02aa78aa608d982e739fd02f60145ff242e01R104) but the [test fails to find `strrep` function](https://github.com/apache/arrow/pull/11023/files#diff-db6c692c9cea1ab0ce5ff089ae635c22182e26bdb95668bb16d64c26e8a3bbf0R475).
   Nevertheless, I am able to run these dplyr tests successfully locally.
   _Note_: `strrep` is the Arrow implementation of base R function with same name.
   cc @jonkeane 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740702096



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -19,6 +19,7 @@
 #include <cctype>
 #include <iterator>
 #include <string>
+#include <typeinfo>

Review comment:
       No, I used it when trying to print the StringTransform type using `typeid(t).name()` but it printed more info than needed.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -19,6 +19,7 @@
 #include <cctype>
 #include <iterator>
 #include <string>
+#include <typeinfo>

Review comment:
       No, I used it when trying to print the `StringTransform` type using `typeid(t).name()` but it printed more info than needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-957717084


   Hmm, right now the only compute function with a name starting with `string_` is `string_is_ascii`, and it's string-only. Functions which take both binary and string are generally named `binary_something`.
   
   (not saying this is a great naming scheme, but this is what we've been doing and it might be better to remain consistent :-))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741220267



##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       Not for this PR, so I will revert. Nevertheless, I have noticed that there are several imports missing and probably some extra in several files. I think this should be its own JIRA issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741135553



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeatWithScalarRepeat) {

Review comment:
       Is there a place where passing a scalar for the strings argument is tested? Is it implicit in `CheckVarArgs`?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2878,135 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1_ncodeunits * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const ArrayType2& input2) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1.total_values_length() * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const ArrayType2& input2) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  static Result<int64_t> TransformSimpleLoop(const uint8_t* input,
+                                             const int64_t input_string_ncodeunits,
+                                             const int64_t num_repeats, uint8_t* output) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static Result<int64_t> TransformDoublingString(const uint8_t* input,
+                                                 const int64_t input_string_ncodeunits,
+                                                 const int64_t num_repeats,
+                                                 uint8_t* output) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;

Review comment:
       Is there a particular reason for xoring here? I guess it's fine, but it seems like this is really a subtraction?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       Both solutions are fine to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-958101865


   Renamed function to `binary_repeat` and will keep an eye out for naming consistency as we move forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-960061949


   Renamed [R internal function `str_dup](https://github.com/apache/arrow/blob/master/r/R/type.R#L484)` to `duplicate_string` because it was shadowing stringr's `str_dup` and kernel binding for `binary_repeat`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-961085402






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706354519



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.

Review comment:
       Ok, will compute the exact output size. On the bright side, this bypasses the resizing step at end of Exec.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731389134



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       It seems we are not able to use the `VisitBitBlocks` utilities because the current implementation needs to set the output string offsets when traversing both non-null and null positions, and this requires the `position` being visited for both visitors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-915892022


   @edponce Please ping when this is ready for review. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-908251629


   > The current `StringTransformXXX` classes do not easily support non-scalar options. In this PR, we want to be able to do the following:
   > 
   > ```python
   > str_repeat(['a', 'b', 'c'], repeats=[1,2,3])  # ['a', 'bb', 'ccc']
   > ```
   
   To me, this means that the kernel is simply a binary kernel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-915893930


   Ready for review cc @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r705355207



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();

Review comment:
       This line is dead code.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       You could perhaps use `VisitTwoBitBlocksVoid` to make this slightly faster.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto scalar2 = *input2.GetScalar(i);

Review comment:
       Hmm, that will be very inefficient :-( I hope we can find a better way of doing this. Perhaps use `input2.GetView(i)`.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {

Review comment:
       This could take a `int64_t` instead of a `std::shared_ptr<Scalar>` for the second input...

##########
File path: docs/source/cpp/compute.rst
##########
@@ -694,45 +694,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | String-like            | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | String-like            | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | String-like            | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types            | Output type            | Options class                     | Notes |
++=========================+========+========================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like            | String-like            |                                   | \(1)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like            | String-like            |                                   | \(1)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like            | String-like            |                                   | \(2)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like            | String-like            |                                   | \(1)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like            | String-like            |                                   | \(1)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like            | String-like            |                                   | \(1)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like            | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like            | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like            | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+------------------------+------------------------+-----------------------------------+-------+
+| str_repeat              | Binary | String-like            | String-like            |                                   |       |

Review comment:
       The second input type should be "Integer".

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach

Review comment:
       This comment is a bit misleading (memcpy is not a O(1) operation), though I understand the underlying idea.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");

Review comment:
       "scalar" rather than "array"?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");

Review comment:
       This wouldn't be too difficult to implement, would it?
   (note: perhaps some repetition can be avoided by factoring out common pieces of code between the four `ExecXXX` variants, though I'm not sure how easy that is)

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;

Review comment:
       This default implementation looks arbitrary. IMHO it would be safer to make it pure virtual.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach

Review comment:
       The irony is that for small nrepeats, this may be slower than the more straightforward approach, of course :-)
   That said, I'm not sure this kernel is really performance-critical.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.

Review comment:
       The problem is that the upper limit may end up huge if there is a single large repeat count in the array.
   
   It seems to me that traversing twice is actually better here (or you can bit the bullet and allow some resizing while building up the output, but that's not compatible with `StringBinaryTransformBase`).

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+      int64_t i = 1;
+      for (int64_t ilen = input_string_ncodeunits; i <= (nrepeats / 2);
+           i *= 2, ilen *= 2) {
+        std::memcpy(output, output_start, ilen);
+        output += ilen;
+      }
+
+      // Epilogue remainder
+      int64_t rem = (nrepeats ^ i) * input_string_ncodeunits;
+      std::memcpy(output, output_start, rem);
+      output += rem;
+    }
+    return output - output_start;
+  }
+};
+
+template <typename Type1, typename Type2>
+using StrRepeat =
+    StringBinaryTransformExec<Type1, Type2, StrRepeatTransform<Type1, Type2>>;
+
+template <template <typename...> class ExecFunctor>

Review comment:
       I'm curious why `ExecFunction` is declared as a template class here, while below `ExecFunctor` is used without parametrization.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -557,6 +558,36 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StrRepeat) {
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> repeats_and_expected{{
+      {-1, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1,
+       R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])"},
+      {3,
+       R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbb", "ɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : repeats_and_expected) {
+    auto repeat = pair.first;
+    auto expected = pair.second;
+    this->CheckVarArgs("str_repeat", {values, Datum(repeat)}, this->type(), expected);

Review comment:
       I'm curious: are we sure `Datum(repeat)` instantiates an integer scalar?
   For the sake of clarity, I would call something like `MakeScalar(repeat, int64())`.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -50,13 +50,14 @@ class BaseTestStringKernels : public ::testing::Test {
     CheckScalarUnary(func_name, type(), json_input, out_ty, json_expected, options);
   }
 
-  void CheckBinaryScalar(std::string func_name, std::string json_left_input,
-                         std::string json_right_scalar, std::shared_ptr<DataType> out_ty,
-                         std::string json_expected,
-                         const FunctionOptions* options = nullptr) {
-    CheckScalarBinaryScalar(func_name, type(), json_left_input, json_right_scalar, out_ty,
-                            json_expected, options);
-  }
+  // void CheckBinaryScalar(std::string func_name, std::string json_left_input,
+  //                        std::string json_right_scalar, std::shared_ptr<DataType>
+  //                        out_ty, std::string json_expected, const FunctionOptions*
+  //                        options = nullptr) {
+  //   CheckScalarBinaryScalar(func_name, type(), json_left_input, json_right_scalar,
+  //   out_ty,
+  //                           json_expected, options);
+  // }

Review comment:
       Why is this commented out?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -557,6 +558,36 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StrRepeat) {
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> repeats_and_expected{{
+      {-1, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1,
+       R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])"},
+      {3,
+       R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbb", "ɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : repeats_and_expected) {
+    auto repeat = pair.first;
+    auto expected = pair.second;
+    this->CheckVarArgs("str_repeat", {values, Datum(repeat)}, this->type(), expected);
+  }
+}
+
+TYPED_TEST(TestStringKernels, StrRepeats) {
+  auto repeats = ArrayFromJSON(int64(), R"([1, 2, 4, 2, 0, 1, 3, 2, 3, -1])");
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow", "one"])");
+  std::string expected =
+      R"(["aAazZæÆ&", "", "bbbb", "ɑɽⱤoWɑɽⱤoW", "", "ⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow", ""])";
+  this->CheckVarArgs("str_repeat", {values, repeats}, this->type(), expected);

Review comment:
       Nulls should be tested too...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706543733



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -50,13 +50,14 @@ class BaseTestStringKernels : public ::testing::Test {
     CheckScalarUnary(func_name, type(), json_input, out_ty, json_expected, options);
   }
 
-  void CheckBinaryScalar(std::string func_name, std::string json_left_input,
-                         std::string json_right_scalar, std::shared_ptr<DataType> out_ty,
-                         std::string json_expected,
-                         const FunctionOptions* options = nullptr) {
-    CheckScalarBinaryScalar(func_name, type(), json_left_input, json_right_scalar, out_ty,
-                            json_expected, options);
-  }
+  // void CheckBinaryScalar(std::string func_name, std::string json_left_input,
+  //                        std::string json_right_scalar, std::shared_ptr<DataType>
+  //                        out_ty, std::string json_expected, const FunctionOptions*
+  //                        options = nullptr) {
+  //   CheckScalarBinaryScalar(func_name, type(), json_left_input, json_right_scalar,
+  //   out_ty,
+  //                           json_expected, options);
+  // }

Review comment:
       This was existing code that was not used. I need to provide better support for binary kernels, so that scalar-array/array-scalar can be tested.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-908250449


   The current `StringTransformXXX` classes do not easily support non-scalar options. In this PR, we want to be able to do the following:
   ```python
   str_repeat(['a', 'b', 'c'], repeats=[1,2,3])  # ['a', 'bb', 'ccc']
   ```
   
   *Possible solution:* Override the [`ExecArray` of `StringTransformExecBase`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L327) and specialize for kernels that require the current index of the input string. This is done by passing the string index to the [`transform->Transform(..., i)` call](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L355-L356). We need to keep in mind that these indexes are relative to the current `ExecBatch` so we need to offset accordingly.
   
   cc @pitrou @lidavidm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731395418



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       I will take note and maybe we can make these changes in another PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740435242



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;
+
+  static int64_t TransformSimple(const uint8_t* input,
+                                 const int64_t input_string_ncodeunits,
+                                 const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static int64_t TransformDoubling(const uint8_t* input,
+                                   const int64_t input_string_ncodeunits,
+                                   const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;
+    std::memcpy(output, output_start, rem);
+    output += rem;
+    return output - output_start;
+  }
+
+  static int64_t TransformWrapper(const uint8_t* input,
+                                  const int64_t input_string_ncodeunits,
+                                  const int64_t num_repeats, uint8_t* output,
+                                  Status* st) {
+    auto transform = (num_repeats < 4) ? TransformSimple : TransformDoubling;
+    return transform(input, input_string_ncodeunits, num_repeats, output, st);
+  }
+
+  Status PreExec(KernelContext*, const ExecBatch& batch, Datum*) override {
+    // For cases with a scalar repeat count, select the best implementation once
+    // before execution. Otherwise, use TransformWrapper to select implementation
+    // when processing each value.

Review comment:
       I did not measured this so will run benchmarks to compare.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739539355



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -402,16 +401,16 @@ struct StringTransformExecBase {
     if (!input.is_valid) {
       return Status::OK();
     }
-    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
-    result->is_valid = true;
     const int64_t data_nbytes = static_cast<int64_t>(input.value->size());
-
     const int64_t output_ncodeunits_max = transform->MaxCodeunits(1, data_nbytes);
     if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
       return Status::CapacityError(
           "Result might not fit in a 32bit utf8 array, convert to large_utf8");
     }
+
     ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());

Review comment:
       Cases like `const auto& obs = checked_cast<const Type&>(*some_var);` will actually generate a `std::bad_cast` exception so we wouldn't have to worry about those cases.  I suspect it could be solved with specialization.  Zulip dev is probably a good place for it.  No need to tackle in this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740251713



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -19,6 +19,7 @@
 #include <cctype>
 #include <iterator>
 #include <string>
+#include <typeinfo>

Review comment:
       I don't think this is used, is it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741054324



##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       nit: is this import necessary?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741388274



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeatWithScalarRepeat) {

Review comment:
       Yes, it is implicit in `CheckVarArgs`. `CheckVarArgs` invokes [`CheckScalar` which internally calls function for each scalar input](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/test_util.cc#L127).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-961085402


   Benchmark runs are scheduled for baseline = 5897217ec5ee6f4f58373362a76a70618921c128 and contender = 0ead7c906dafb73c2b2829681845fe5a808a54e9. 0ead7c906dafb73c2b2829681845fe5a808a54e9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/889ba7b7a56b485ea9df25c008235283...682c7eeb6129469c84e53494bb85219e/)
   [Failed :arrow_down:1.54% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/091cc986a04b485bbb53f8d006dd066b...0fd03cce4b804a98883de9f7b6c658c7/)
   [Finished :arrow_down:1.25% :arrow_up:0.89%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/32da2cf3bb9448869c2ef7f8003106a9...5c488ba9051d4cee94d0ec52741539fb/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-912904767


   This PR depends on https://github.com/apache/arrow/pull/11082 (ARROW-13898) which adds supports for string binary compute functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706327671



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");

Review comment:
       Ok, I will implement the missing case and factor out common code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907569782


   Related to a replicate operation, there was a [previous discussion in Zulip chat](https://ursalabs.zulipchat.com/#narrow/stream/271283-help.2Fc.2B.2B/topic/util.20to.20copy.20arrays.20to.20an.20existing.20buffer) of having a general replicate functionality where string repeat is a particular case.
   
   Arrow already has [`MakeArrayFromScalar`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L742) and [`RepeatedArrayFactory`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L493) which use [concatenate implementation](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/concatenate.cc) internally. Can this be used in this PR? These are specifically for Array types and in kernel transform method uses raw pointers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-960061949






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741755299



##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       Good to know. Thanks!

##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       There is a minor difference in behavior when given a negative repeat count:
   * `binary_repeat` and `strrep` return an error
   * `str_dup` returns `NA`

##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       Good to know. Thanks!

##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       There is a minor difference in behavior when given a negative repeat count:
   * `binary_repeat` and `strrep` return an error
   * `str_dup` returns `NA`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-961085402






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907701399


   @lidavidm Those `ArrayBuilder` methods do work to perform this operation but will require not following the common approach used for string kernels based on the already provided [`StringTransformXXX` infrastructure](https://github.com/edponce/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L314). Specifically, it would require overriding [`ExecArray()`](https://github.com/edponce/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L327) (while duplicating most of it). For how things currently are, I think using the `ArrayBuilder/MakeScalar` methods for `StrRepeat` is not preferable.
   
   The current `StrRepeat` implementation only allocates once the entire array for all repeated strings via [`ExecArray()`](https://github.com/edponce/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L335-L343). `StrRepeat` overrides `MaxCodeunits()` to return `input_ncodeunits * n_repeats`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731389134



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       It seems we are not able to use the `VisitBitBlocks` utilities because the current implementation needs to set the output string offsets (`output_string_offsets`) when traversing both non-null and null positions, and this requires the `position` being visited for both visitors.
   ```c++
   offset_type output_ncodeunits = 0;
   for (i = 0...) {
     if (!input1.IsNull(i)) {
       ...
       offset_type encoded_bytes = Transform(...);
       ...
       output_ncodeunits += encoded_bytes;
     }
     // This needs to be updated for Null/NotNull visitors
     output_string_offsets[i + 1] = output_ncodeunits;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-947079157


   I have the following questions which I am not sure how to resolve:
   1. I tried allowing integers, floating point, and boolean to the `num_repeats` argument. These are [casted to `Int64Type` via `DispatchBest`](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2925-R2930) but the [floating point casting to integers triggers truncation error](https://github.com/apache/arrow/pull/11023/files#diff-f17bc6ceaa2e16784e3ade31f1cafde21fcd6fa19800c6601dac12e28e7fa79dR798). How can this be achieved?
   2. Should an error be return if `num_repeats` argument is non-negative? Currently, a negative value is treated as a zero-value to match Python behavior, but base R `strrep` triggers error.
   3. Added [R binding named as `strrep`](https://github.com/apache/arrow/pull/11023/files#diff-43be4da1ac54813d9268544a10ea7cf92b2398ff682fee337bcd2a85db98ddd6R337) but the base R version is used instead cc @jonkeane 
   ```r
   Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R
   ```
   
   cc @lidavidm @bkietz 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-957717084


   Hmm, right now the only compute function with a name starting with `string_` is `string_is_ascii`, and it's string-only. Functions which take both binary and string are generally named `binary_something`.
   
   (not saying this is a great naming scheme, but this is what we've been doing and it might be better to remain consistent :-))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740468867



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();

Review comment:
       I was simply following the convention used for the unary string transform cases, but it definitely fits the bill here. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740687132



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;
+
+  static int64_t TransformSimple(const uint8_t* input,
+                                 const int64_t input_string_ncodeunits,
+                                 const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static int64_t TransformDoubling(const uint8_t* input,
+                                   const int64_t input_string_ncodeunits,
+                                   const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;
+    std::memcpy(output, output_start, rem);
+    output += rem;
+    return output - output_start;
+  }
+
+  static int64_t TransformWrapper(const uint8_t* input,
+                                  const int64_t input_string_ncodeunits,
+                                  const int64_t num_repeats, uint8_t* output,
+                                  Status* st) {
+    auto transform = (num_repeats < 4) ? TransformSimple : TransformDoubling;
+    return transform(input, input_string_ncodeunits, num_repeats, output, st);
+  }
+
+  Status PreExec(KernelContext*, const ExecBatch& batch, Datum*) override {
+    // For cases with a scalar repeat count, select the best implementation once
+    // before execution. Otherwise, use TransformWrapper to select implementation
+    // when processing each value.

Review comment:
       Using `std::function` indirection resulted in 2x slower, so good call/intuition on this one.
   ```
   StringRepeat_mean    622822087 ns    622817699 ns           10 bytes_per_second=25.4509M/s items_per_second=1.68498M/s
   StringRepeat_median  623393064 ns    623390528 ns           10 bytes_per_second=25.4067M/s items_per_second=1.68205M/s
   StringRepeat_stddev   18771545 ns     18770743 ns           10 bytes_per_second=787.511k/s items_per_second=50.9153k/s
   ```
   Checking `num_repeats < 4` at each iteration
   ```
   StringRepeat_mean    313125674 ns    313123902 ns           10 bytes_per_second=50.601M/s items_per_second=3.35004M/s
   StringRepeat_median  312795031 ns    312794088 ns           10 bytes_per_second=50.6405M/s items_per_second=3.35266M/s
   StringRepeat_stddev    6484104 ns      6484645 ns           10 bytes_per_second=1068k/s items_per_second=69.0502k/s
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740688427



##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       I think solution 2 is better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740390414



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",
+                           " (", batch[0].ToString(), ", ", batch[1].ToString(),
+                           "). Only Array/Scalar kinds are supported.");

Review comment:
       I added it but maybe it is overkill.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907701537


   Sure, I'm talking about more general repeat methods, though I guess now I question what you might want to repeat other than binary-like types and I suppose lists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706352758



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto scalar2 = *input2.GetScalar(i);

Review comment:
       I agree, I thought about using `GetView()` but it returns a `util::string_view` data which does not has the same API as a `Scalar`, so if the second input is not a string then these Exec would not work. This applies to the second parameter of all the Execs. I generalized it so that the API handles `Scalars` and `ArrayData` and it is up to the implementation to decode the parameter correctly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907510752


   https://issues.apache.org/jira/browse/ARROW-12712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-960061949


   Renamed [R internal function `str_dup`](https://github.com/apache/arrow/blob/master/r/R/type.R#L484) to `duplicate_string` because it was shadowing stringr's `str_dup` and kernel binding for `binary_repeat`.
   Thanks to @thisisnic for identifying this subtle issue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741054324



##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       nit: is this import necessary?

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       We have a tool called include-what-you-use, though I don't think it's been run a while. It might be good to give that a try again. (IIRC, it's a bit finicky to set up.)

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       https://arrow.apache.org/docs/developers/cpp/development.html#cleaning-includes-with-include-what-you-use-iwyu




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-960061949






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741377335



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2878,135 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1_ncodeunits * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const ArrayType2& input2) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1.total_values_length() * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const ArrayType2& input2) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  static Result<int64_t> TransformSimpleLoop(const uint8_t* input,
+                                             const int64_t input_string_ncodeunits,
+                                             const int64_t num_repeats, uint8_t* output) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static Result<int64_t> TransformDoublingString(const uint8_t* input,
+                                                 const int64_t input_string_ncodeunits,
+                                                 const int64_t num_repeats,
+                                                 uint8_t* output) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;

Review comment:
       Not really, `xor` is just representing `mod 2` but in this case subtraction is also valid.
   Changed it to subtraction and renamed variable to `irep` for improved readability.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739611163



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+    const auto& binary_scalar1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    const auto input_string = binary_scalar1.value->data();
+    const auto input_ncodeunits = binary_scalar1.value->size();
+    const auto value2 = UnboxScalar<Type2>::Unbox(*scalar2);
+
+    // Calculate max number of output codeunits
+    const auto max_output_ncodeunits = transform->MaxCodeunits(input_ncodeunits, value2);

Review comment:
       The output size depends on the transform and the input encoding (binary/ASCII/UTF8). Also, the `MaxCodeunits()` does not needs to calculate the exact output size because [a resizing operation is performed at end kernel exec](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L422), but needs to allocate enough space, not less.
   
   Binary/ASCII transforms that do not change the size (uppercase, title, capitalize, etc.), [use the default `MaxCodeunits()`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L330). On the other hand, the [default `MaxCodeunits()` for UTF8 transform](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L555) allocates more for the output.
   
   Some transforms will have different estimates for the output size (as is the case in this PR). This is the first "binary string transform" implemented as such and so I decided to generalize the machinery in order to support other ones.
   
   But most importantly is to note that [many string transforms implement their own `kernel exec`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1010) and do not use `MaxCodeunits()`. Hopefully, as the variety of patterns in string transforms stabilizes, we can use consistent `kernel execs` and perform similarly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-955765038


   This PR is failing R dplyr tests ([see here](https://github.com/apache/arrow/runs/4053868784?check_suite_focus=true#step:8:17122)). [`strrep` is binded as a binary expression](https://github.com/apache/arrow/pull/11023/files#diff-ed2774950584af59273e99c303c02aa78aa608d982e739fd02f60145ff242e01R104) but the [test fails to find `strrep` function](https://github.com/apache/arrow/pull/11023/files#diff-db6c692c9cea1ab0ce5ff089ae635c22182e26bdb95668bb16d64c26e8a3bbf0R475).
   _Note_: `strrep` is the Arrow implementation of base R function with same name.
   cc @jonkeane 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740466490



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;

Review comment:
       Needs to be `Transform` because the string binary exec expects `StringTransforms` to have a `Transform` method to the given signature.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731389134



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       It seems we are not able to use the `VisitBitBlocks` utilities because the current `StringBinaryTransformExecBase` implementation when processing `Array` needs to set the output string offsets (`output_string_offsets`) when traversing both non-null and null positions, and this requires the `position` being visited for both visitors.
   ```c++
   offset_type output_ncodeunits = 0;
   for (i = 0...) {
     if (!input1.IsNull(i)) {
       ...
       offset_type encoded_bytes = Transform(...);
       ...
       output_ncodeunits += encoded_bytes;
     }
     // This needs to be updated for Null/NotNull visitors
     output_string_offsets[i + 1] = output_ncodeunits;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731465028



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       Tried using [`ArrayDataInlineVisitor`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/visitor_inline.h#L194-L228) but it is implemented for reading offsets, not modifying them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-947098044


   For 1: why support floating point or boolean arguments in the first place? It seems quite odd. I think those should be explicit casts if the user wants them, and they can choose safe/unsafe cast as appropriate. 
   
   For 2: I don't think there's a clear argument either way. If the difference in behavior is critical in a particular application, the data could always be checked/massaged either way beforehand. Otherwise I would lean towards explicitly erroring. (You could argue that in Python, you're explicitly calling into Arrow and hence it's clear there may be a difference, while in R the user is likely using dplyr and so a difference may not be top-of-mind.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r732311839



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -751,6 +746,108 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeat) {
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  {
+    std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+        {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+        {1,
+         R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])"},
+        {4,
+         R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb", "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+    }};
+
+    for (const auto& pair : nrepeats_and_expected) {
+      auto num_repeat = pair.first;
+      auto expected = pair.second;
+      for (const auto& ty : NumericTypes()) {
+        this->CheckVarArgs("string_repeat",
+                           {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                           this->type(), expected);
+      }
+    }
+  }
+  {
+    // Negative repeat count
+    std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+        {-1, R"(["", null, "", "", "", "", "", "", "", ""])"},
+        {-4, R"(["", null, "", "", "", "", "", "", "", ""])"},
+    }};
+
+    for (const auto& pair : nrepeats_and_expected) {
+      auto num_repeat = pair.first;
+      auto expected = pair.second;
+      for (const auto& ty : SignedIntTypes()) {
+        this->CheckVarArgs("string_repeat",
+                           {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                           this->type(), expected);
+      }
+    }
+  }
+  // {
+  //   // Truncated floating point repeat count
+  //   std::vector<std::pair<double, std::string>> nrepeats_and_expected{{
+  //       {0.9, R"(["", null, "", "", "", "", "", "", "", ""])"},
+  //       {1.8,
+  //        R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$.
+  //        A3", "!ɑⱤⱤow"])"},
+  //       {4.4,
+  //        R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb",
+  //        "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO,
+  //        WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3$. A3",
+  //        "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  //   }};
+  //
+  //   for (const auto& pair : nrepeats_and_expected) {
+  //     auto num_repeat = pair.first;
+  //     auto expected = pair.second;
+  //     for (const auto& ty : FloatingPointTypes()) {
+  //       this->CheckVarArgs("string_repeat",
+  //                          {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+  //                          this->type(), expected);
+  //     }
+  //   }
+  // }

Review comment:
       Should this be commented?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-947272294


   [Validation of repeat count occurs here](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2903-R2908) but only for when it is a Scalar value. If it is an Array, no validation occurs and error is delegated to [output allocation](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R776).
   
   I understand we should be consistent, so should we validate repeat count for Arrays as well and accept the performance hit, or should we not validate at all and let the output allocation error out?
   
   It is difficult to error out from inside the transform because it [does not output a `Status`, simply the number of transformed/encoded bytes](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R793) contrast to the arithmetic kernels which have a `Status` parameter.
   
   Based on this thought process, I am leaning towards either no validation at all or add a `Status` parameter to the string transform.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706358970



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach

Review comment:
       Ok, will fix the complexity to `O(log2(nrepeats))`
   
   In terms of performance, based on isolated benchmarks I performed comparing several *copy* approaches, the log2 approach is faster for all cases where `nrepeats >= 4`, and for `nrepeats < 4` it was not reasonably slower than direct copies. In my initial PR, I had an `if-else` to handle this, but thought that having the condition check for all values, in addition, to having two approaches, was not better.
   
   This circles back to some of my previous comments/ideas, that the Exec methods should provide a mechanism for selecting kernel `Transform/Call` variants based on these higher-level options. More on this very soon.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-908253348


   I agree that this is a binary kernel because the number of repeats is required.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-961085402


   Benchmark runs are scheduled for baseline = 5897217ec5ee6f4f58373362a76a70618921c128 and contender = 0ead7c906dafb73c2b2829681845fe5a808a54e9. 0ead7c906dafb73c2b2829681845fe5a808a54e9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/889ba7b7a56b485ea9df25c008235283...682c7eeb6129469c84e53494bb85219e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/091cc986a04b485bbb53f8d006dd066b...0fd03cce4b804a98883de9f7b6c658c7/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/32da2cf3bb9448869c2ef7f8003106a9...5c488ba9051d4cee94d0ec52741539fb/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-961085402


   Benchmark runs are scheduled for baseline = 5897217ec5ee6f4f58373362a76a70618921c128 and contender = 0ead7c906dafb73c2b2829681845fe5a808a54e9. 0ead7c906dafb73c2b2829681845fe5a808a54e9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/889ba7b7a56b485ea9df25c008235283...682c7eeb6129469c84e53494bb85219e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/091cc986a04b485bbb53f8d006dd066b...0fd03cce4b804a98883de9f7b6c658c7/)
   [Finished :arrow_down:1.25% :arrow_up:0.89%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/32da2cf3bb9448869c2ef7f8003106a9...5c488ba9051d4cee94d0ec52741539fb/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r742160225



##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       There is a minor difference in behavior when given a negative repeat count:
   * `binary_repeat` and `strrep` return an error
   * `str_dup` returns `NA`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r738333645



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -402,16 +401,16 @@ struct StringTransformExecBase {
     if (!input.is_valid) {
       return Status::OK();
     }
-    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
-    result->is_valid = true;
     const int64_t data_nbytes = static_cast<int64_t>(input.value->size());
-
     const int64_t output_ncodeunits_max = transform->MaxCodeunits(1, data_nbytes);
     if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
       return Status::CapacityError(
           "Result might not fit in a 32bit utf8 array, convert to large_utf8");
     }
+
     ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());

Review comment:
       Nit: You don't really have to, and I don't know how this could possibly not work, but I'm in the habit of adding `DCHECK_NE(result, nullptr);`after any `checked_cast`.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);

Review comment:
       Ah, nevermind, I see now that it is overridden if the kernel is lengthening the string.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);

Review comment:
       This may just be my ignorance with these kind of generators but it isn't obvious to me what the return value represents?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+    const auto& binary_scalar1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    const auto input_string = binary_scalar1.value->data();
+    const auto input_ncodeunits = binary_scalar1.value->size();
+    const auto value2 = UnboxScalar<Type2>::Unbox(*scalar2);
+
+    // Calculate max number of output codeunits
+    const auto max_output_ncodeunits = transform->MaxCodeunits(input_ncodeunits, value2);

Review comment:
       This is more a learning question for me but how can you calculate the output size based on the input?  Don't some kernels output strings that are longer than the input?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }

Review comment:
       Minor nit: Maybe name this `InvalidUtf8Sequence()`

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");

Review comment:
       This probably won't happen too often but could you inlcude the two "kinds" that led to this?  For example, it would be much clearer to the user to see "Invalid combination of operands for binary string transform fn-name (array, scalar)"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739537974



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }

Review comment:
       If it is an existing pattern then please do not fix in this PR.  My naive thought would be that `st` should be used for this purpose.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739611163



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +536,341 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  //
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables argument shapes with
+  // mixed scalar/array.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+
+  // Tracks status of transform in StringBinaryTransformExecBase.
+  // The purpose of this transform status is to provide a means to report/detect
+  // errors in functions that do not provide a mechanism to return a Status
+  // value but can still detect errors. This status is checked automatically
+  // after MaxCodeunits() and Transform() operations.
+  Status st = Status::OK();
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output);
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+    const auto& binary_scalar1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    const auto input_string = binary_scalar1.value->data();
+    const auto input_ncodeunits = binary_scalar1.value->size();
+    const auto value2 = UnboxScalar<Type2>::Unbox(*scalar2);
+
+    // Calculate max number of output codeunits
+    const auto max_output_ncodeunits = transform->MaxCodeunits(input_ncodeunits, value2);

Review comment:
       The output size depends on the transform and the input encoding (binary/ASCII/UTF8). Also, the `MaxCodeunits()` does not needs to calculate the exact output size because [a resizing operation is performed at end kernel exec](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L422), but needs to allocate enough space, not less.
   
   Binary/ASCII transforms that do not change the size (uppercase, title, capitalize, etc.), [use the default `MaxCodeunits()`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L330). On the other hand, the [default `MaxCodeunits()` for UTF8 transform](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L555) allocates more for the output.
   
   Some transforms will have different estimates for the output size (as is the case in this PR so `MaxCodeunits()` is overriden). This is the first "binary string transform" implemented as such and so I decided to generalize the machinery in order to support other ones.
   
   But most importantly is to note that [many string transforms implement their own `kernel exec`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1010) and do not use `MaxCodeunits()`. Hopefully, as the variety of patterns in string transforms stabilizes, we can use consistent `kernel execs` without incurring in performance penalties.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740351649



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",
+                           " (", batch[0].ToString(), ", ", batch[1].ToString(),
+                           "). Only Array/Scalar kinds are supported.");

Review comment:
       Since it is not known a priori what are the valid combinations, I decided to remove the second statement from the error message.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740687132



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;
+
+  static int64_t TransformSimple(const uint8_t* input,
+                                 const int64_t input_string_ncodeunits,
+                                 const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static int64_t TransformDoubling(const uint8_t* input,
+                                   const int64_t input_string_ncodeunits,
+                                   const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;
+    std::memcpy(output, output_start, rem);
+    output += rem;
+    return output - output_start;
+  }
+
+  static int64_t TransformWrapper(const uint8_t* input,
+                                  const int64_t input_string_ncodeunits,
+                                  const int64_t num_repeats, uint8_t* output,
+                                  Status* st) {
+    auto transform = (num_repeats < 4) ? TransformSimple : TransformDoubling;
+    return transform(input, input_string_ncodeunits, num_repeats, output, st);
+  }
+
+  Status PreExec(KernelContext*, const ExecBatch& batch, Datum*) override {
+    // For cases with a scalar repeat count, select the best implementation once
+    // before execution. Otherwise, use TransformWrapper to select implementation
+    // when processing each value.

Review comment:
       Using `std::function` indirection resulted in 2x slower, so good call/intuition on this one.
   ```
   StringRepeat_mean    622822087 ns    622817699 ns           10 bytes_per_second=25.4509M/s items_per_second=1.68498M/s
   StringRepeat_median  623393064 ns    623390528 ns           10 bytes_per_second=25.4067M/s items_per_second=1.68205M/s
   StringRepeat_stddev   18771545 ns     18770743 ns           10 bytes_per_second=787.511k/s items_per_second=50.9153k/s
   ```
   Checking `num_repeats < 4` at each iteration
   ```
   StringRepeat_mean    313125674 ns    313123902 ns           10 bytes_per_second=50.601M/s items_per_second=3.35004M/s
   StringRepeat_median  312795031 ns    312794088 ns           10 bytes_per_second=50.6405M/s items_per_second=3.35266M/s
   StringRepeat_stddev    6484104 ns      6484645 ns           10 bytes_per_second=1068k/s items_per_second=69.0502k/s
   ```

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -19,6 +19,7 @@
 #include <cctype>
 #include <iterator>
 #include <string>
+#include <typeinfo>

Review comment:
       No, I used it when trying to print the StringTransform type using `typeid(t).name()` but it printed more info than needed.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -19,6 +19,7 @@
 #include <cctype>
 #include <iterator>
 #include <string>
+#include <typeinfo>

Review comment:
       No, I used it when trying to print the `StringTransform` type using `typeid(t).name()` but it printed more info than needed.

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       Not for this PR, so I will revert. Nevertheless, I have noticed that there are several imports missing and probably some extra in several files. I think this should be its own JIRA issue.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2878,135 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1_ncodeunits * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const ArrayType2& input2) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1.total_values_length() * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const ArrayType2& input2) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  static Result<int64_t> TransformSimpleLoop(const uint8_t* input,
+                                             const int64_t input_string_ncodeunits,
+                                             const int64_t num_repeats, uint8_t* output) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static Result<int64_t> TransformDoublingString(const uint8_t* input,
+                                                 const int64_t input_string_ncodeunits,
+                                                 const int64_t num_repeats,
+                                                 uint8_t* output) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;

Review comment:
       Not really, `xor` is just representing `mod 2` but in this case subtraction is also valid.
   Changed it to subtraction and renamed variable to `irep` for improved readability.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeatWithScalarRepeat) {

Review comment:
       Yes, it is implicit in `CheckVarArgs`. `CheckVarArgs` invokes [`CheckScalar` which internally calls function for each scalar input](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/test_util.cc#L127).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741576370



##########
File path: r/tests/testthat/test-dplyr-funcs-string.R
##########
@@ -467,6 +467,18 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("strrep", {
+  df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
+  for (times in 0:8L) {
+    compare_dplyr_binding(
+      .input %>%
+        mutate(x = strrep(x, times)) %>%
+        collect(),
+      df
+    )
+  }
+})
+

Review comment:
       Adds a test for the `str_dup()` binding I suggested above. Also FYI you don't need the `L` after `8` because the `:` operator in R always creates integer vectors when its operands are whole numbers.
   ```suggestion
   test_that("strrep, str_dup", {
     df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
     for (times in 0:8) {
       compare_dplyr_binding(
         .input %>%
           mutate(x = strrep(x, times)) %>%
           collect(),
         df
       )
       compare_dplyr_binding(
         .input %>%
           mutate(x = str_dup(x, times)) %>%
           collect(),
         df
       )
     }
   })
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741575436



##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       `strrep()` is a function in base R. There is also a function [`str_dup()`](https://stringr.tidyverse.org/reference/str_dup.html) in the popular R package **stringr** that does exactly the same thing. In the R bindings we often like to add these **stringr** variants of the functions too:
   ```suggestion
     "strrep" = "binary_repeat",
     "str_dup" = "binary_repeat"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741223669



##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       We have a tool called include-what-you-use, though I don't think it's been run a while. It might be good to give that a try again. (IIRC, it's a bit finicky to set up.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907569782


   Related to a replicate operation, there was a [previous discussion in Zulip chat](https://ursalabs.zulipchat.com/#narrow/stream/271283-help.2Fc.2B.2B/topic/util.20to.20copy.20arrays.20to.20an.20existing.20buffer) of having a general replicate functionality where string repeat is a particular case.
   
   Arrow already has [`MakeArrayFromScalar`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L742) and [`RepeatedArrayFactory`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc#L493) which use [concatenate implementation](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/concatenate.cc) internally. Can this be used in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731491178



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       Ok, so got `ArrayDataInlineVisitor` working.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706345457



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");

Review comment:
       I copied the same message as in "unary" `StringTransformExecBase`. Note, that the term _array_ in this context refers to the buffer holding the string value which is allocated with size of `MaxCodeUnits()`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-920976700


   Feel free to undraft when this is ready @edponce .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907569782


   Related to a replicate operation, there was a [previous discussion in Zulip chat](https://ursalabs.zulipchat.com/#narrow/stream/271283-help.2Fc.2B.2B/topic/util.20to.20copy.20arrays.20to.20an.20existing.20buffer) of having a general replicate functionality where string repeat is a particular case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706358970



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach

Review comment:
       Ok, will fix the complexity to `O(log2(nrepeats))`
   
   In terms of performance, based on isolated benchmarks I performed comparing several *copy* approaches, the log2 approach is faster for all cases where `nrepeats >= 4`, and for `nrepeats < 4` it was not reasonably slower than direct copies. [In my initial PR, I had an `if-else` to handle this](https://github.com/apache/arrow/pull/11023/commits/a0e327d2751137a8b7d47ad524c848eef65066ff#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2596-R2612), but thought that having the condition check for all values, in addition, to having two approaches, was not better.
   
   This circles back to some of my previous comments/ideas, that the Exec methods should provide a mechanism for selecting kernel `Transform/Call` variants based on these higher-level options. More on this very soon.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-907701399


   @lidavidm Those `ArrayBuilder` methods do work to perform this operation but will require not following the common approach used for string kernels based on the already provided [`StringTransformXXX` infrastructure](https://github.com/edponce/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L314). Specifically, it would require overriding [`ExecArray()`](https://github.com/edponce/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L327) (while duplicating most of it). For how things currently are, I think using the `ArrayBuilder/MakeScalar` methods for `StrRepeat` is not preferable.
   
   Also, note that the current `StrRepeat` implementation only allocates once the entire array for all repeated strings via [`ExecArray()`](https://github.com/edponce/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L335-L343). `StrRepeat` overrides `MaxCodeunits()` to return `input_ncodeunits * n_repeats`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-947510017


   BTW also in Python the "repeat" of strings errors for a float:
   
   ```
   In [14]: "a" * 2.5
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   <ipython-input-14-0a372606199b> in <module>
   ----> 1 "a" * 2.5
   ```
   
   (for negative integer it indeed returns an empty string, but I agree we should not necessarily follow that and I would also expect an error).
   
   In pandas' `Series.str.repeat`, if you use a float it actually results in missing values. But I would say that's a bug in pandas and that it should rather raise an error instead (it's because in the default implementation of applying a function on each string, we catch errors and in that case return a missing value instead for that string).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-953373581


   The new changes capture invalid repeat counts early-on (in [`PreExec` when Scalar](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2963-R2967) and in [`MaxCodeunit` when Array](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2888)). Note that to capture these errors and trickle up the error to the `ExecXXX` class/methods [here](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R663) and [here](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R676), [a `Status` data member was added to `StringBinaryTransformBase`](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R604). The alternative would be to add a `Status` parameter to `MaxCodeunit()` 
 and `Transform()`, which now that I think of it, it makes more sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741755299



##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       Good to know. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-957717084


   Hmm, right now the only compute function with a name starting with `string_` is `string_is_ascii`, and it's string-only. Functions which take both binary and string are generally named `binary_something`.
   
   (not saying this is a great naming scheme, but this is what we've been doing and it might be better to remain consistent :-))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706434101



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {

Review comment:
       No, because the `ExecXXX` method in `StringBinaryTransformExecBase` can't deduce what is the type of the second parameter to compute function, only if it is a Scalar or Array.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706434526



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach

Review comment:
       I added faster implementation for short repeats (which may the common case). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-952357402


   @jorisvandenbossche Thanks for the observation on `float` types. Currently, the C++ string repeat kernels only accept integer values for the repeat count, so if a `float` is provided then it will fail to find a kernel with such support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r739524798



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -402,16 +401,16 @@ struct StringTransformExecBase {
     if (!input.is_valid) {
       return Status::OK();
     }
-    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
-    result->is_valid = true;
     const int64_t data_nbytes = static_cast<int64_t>(input.value->size());
-
     const int64_t output_ncodeunits_max = transform->MaxCodeunits(1, data_nbytes);
     if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
       return Status::CapacityError(
           "Result might not fit in a 32bit utf8 array, convert to large_utf8");
     }
+
     ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto* result = checked_cast<BaseBinaryScalar*>(out->scalar().get());

Review comment:
       After a bit of more thought, it is not that easy to enforce because in many cases the pointer is dereferenced beforehand:
   ```c++
   const auto& obs = checked_cast<const Type&>(*some_var);
   ```
   So the more general question is when should pointers  be checked for nullity? Should we check everywhere a raw pointer is accessed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r732325339



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -751,6 +746,108 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeat) {
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  {
+    std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+        {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+        {1,
+         R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])"},
+        {4,
+         R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb", "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+    }};
+
+    for (const auto& pair : nrepeats_and_expected) {
+      auto num_repeat = pair.first;
+      auto expected = pair.second;
+      for (const auto& ty : NumericTypes()) {
+        this->CheckVarArgs("string_repeat",
+                           {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                           this->type(), expected);
+      }
+    }
+  }
+  {
+    // Negative repeat count
+    std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+        {-1, R"(["", null, "", "", "", "", "", "", "", ""])"},
+        {-4, R"(["", null, "", "", "", "", "", "", "", ""])"},
+    }};
+
+    for (const auto& pair : nrepeats_and_expected) {
+      auto num_repeat = pair.first;
+      auto expected = pair.second;
+      for (const auto& ty : SignedIntTypes()) {
+        this->CheckVarArgs("string_repeat",
+                           {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                           this->type(), expected);
+      }
+    }
+  }
+  // {
+  //   // Truncated floating point repeat count
+  //   std::vector<std::pair<double, std::string>> nrepeats_and_expected{{
+  //       {0.9, R"(["", null, "", "", "", "", "", "", "", ""])"},
+  //       {1.8,
+  //        R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$.
+  //        A3", "!ɑⱤⱤow"])"},
+  //       {4.4,
+  //        R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb",
+  //        "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO,
+  //        WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3$. A3",
+  //        "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  //   }};
+  //
+  //   for (const auto& pair : nrepeats_and_expected) {
+  //     auto num_repeat = pair.first;
+  //     auto expected = pair.second;
+  //     for (const auto& ty : FloatingPointTypes()) {
+  //       this->CheckVarArgs("string_repeat",
+  //                          {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+  //                          this->type(), expected);
+  //     }
+  //   }
+  // }

Review comment:
       These are tests for using floating-point values for repeat count (such functionality is not yet supported). Nonetheless, now we are not going to support such implicit casts so I removed this code block in current local version.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741054324



##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       nit: is this import necessary?

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       We have a tool called include-what-you-use, though I don't think it's been run a while. It might be good to give that a try again. (IIRC, it's a bit finicky to set up.)

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       https://arrow.apache.org/docs/developers/cpp/development.html#cleaning-includes-with-include-what-you-use-iwyu




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740238298



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,69 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeat) {
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1,
+       R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])"},
+      {4,
+       R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb", "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},

Review comment:
       Can you perhaps wrap this long line?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();

Review comment:
       Hmm, why wouldn't `MaxCodeUnits` return a `Result<int64_t>`?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;
+
+  static int64_t TransformSimple(const uint8_t* input,
+                                 const int64_t input_string_ncodeunits,
+                                 const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static int64_t TransformDoubling(const uint8_t* input,
+                                   const int64_t input_string_ncodeunits,
+                                   const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;
+    std::memcpy(output, output_start, rem);
+    output += rem;
+    return output - output_start;
+  }
+
+  static int64_t TransformWrapper(const uint8_t* input,
+                                  const int64_t input_string_ncodeunits,
+                                  const int64_t num_repeats, uint8_t* output,
+                                  Status* st) {
+    auto transform = (num_repeats < 4) ? TransformSimple : TransformDoubling;
+    return transform(input, input_string_ncodeunits, num_repeats, output, st);
+  }
+
+  Status PreExec(KernelContext*, const ExecBatch& batch, Datum*) override {
+    // For cases with a scalar repeat count, select the best implementation once
+    // before execution. Otherwise, use TransformWrapper to select implementation
+    // when processing each value.

Review comment:
       I'm curious: is it really better to go through the `std::function` indirection than to compute `num_repeats < 4` at each iteration?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,69 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeat) {
+  auto values = ArrayFromJSON(
+      this->type(),
+      R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1,
+       R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])"},
+      {4,
+       R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb", "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ", "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : nrepeats_and_expected) {
+    auto num_repeat = pair.first;
+    auto expected = pair.second;
+    for (const auto& ty : IntTypes()) {
+      this->CheckVarArgs("string_repeat",
+                         {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                         this->type(), expected);
+    }
+  }
+
+  // Negative repeat count
+  for (auto num_repeat_ : {-1, -2, -5}) {
+    auto num_repeat = *arrow::MakeScalar(int64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        Invalid, ::testing::HasSubstr("Repeat count must be a non-negative integer"),
+        CallFunction("string_repeat", {values, num_repeat}));
+  }
+
+  // Floating-point repeat count
+  for (auto num_repeat_ : {0.0, 1.2, -1.3}) {
+    auto num_repeat = *arrow::MakeScalar(float64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        NotImplemented, ::testing::HasSubstr("has no kernel matching input types"),
+        CallFunction("string_repeat", {values, num_repeat}));
+  }
+}
+
+TYPED_TEST(TestStringKernels, StringRepeats) {

Review comment:
       "StringRepeat" vs "StringRepeats" is slightly confusing. Perhaps make naming more explicit?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",
+                           " (", batch[0].ToString(), ", ", batch[1].ToString(),
+                           "). Only Array/Scalar kinds are supported.");

Review comment:
       "Only Array/Scalar kinds are supported" can be misleading if not all array/scalar combinations are supported.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;

Review comment:
       `transform_`?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",

Review comment:
       Make this `TypeError` perhaps?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;

Review comment:
       According to the style conventions, `enable_array_array_` perhaps?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       Since there's already "binary_replace_slice", should it be called "binary_repeat" instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740417952



##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       Well, [from a previous discussion](https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Stringlike.20kernels.20on.20binary.20data), I am following the pattern that a name with `string` expects/supports both binary and string encoded data. While the `binary` prefix only expects binary non-encoded data and `ascii/utf8` are for encoding-specific functions.
   There are two solutions to be consistent with functions that have either a `binary` or `string` prefix:
   1. Change them all to `binary`
       * `string_repeat` --> `binary_repeat`
       * `string_is_ascii` --> `binary_is_ascii`
   2. Change them all to `string` as they seem to support both binary/string types
       * `binary_length` --> `string_length`
       * `binary_replace_slice` --> `string_replace_slice`
       * `binary_join` --> `string_join`
       * `binary_join_element_wise` --> `string_join_element_wise`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-958101865


   Renamed function to `binary_repeat` and will keep an eye out for naming consistency as we move forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741755299



##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       Good to know. Thanks!

##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       There is a minor difference in behavior when given a negative repeat count:
   * `binary_repeat` and `strrep` return an error
   * `str_dup` returns `NA`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-960061949


   Renamed [R internal function `str_dup](https://github.com/apache/arrow/blob/master/r/R/type.R#L484)` to `duplicate_string` because it was shadowing stringr's `str_dup` and kernel binding for `binary_repeat`.
   Thanks to @thisisnic for identifying this subtle issue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741811378



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, BinaryRepeatWithScalarRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1, R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!",
+              "$. A3", "!ɑⱤⱤow"])"},
+      {4, R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb",
+              "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ",
+              "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!",
+              "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : nrepeats_and_expected) {
+    auto num_repeat = pair.first;
+    auto expected = pair.second;
+    for (const auto& ty : IntTypes()) {
+      this->CheckVarArgs("binary_repeat",
+                         {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                         this->type(), expected);
+    }
+  }
+
+  // Negative repeat count
+  for (auto num_repeat_ : {-1, -2, -5}) {
+    auto num_repeat = *arrow::MakeScalar(int64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        Invalid, ::testing::HasSubstr("Repeat count must be a non-negative integer"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+
+  // Floating-point repeat count
+  for (auto num_repeat_ : {0.0, 1.2, -1.3}) {
+    auto num_repeat = *arrow::MakeScalar(float64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        NotImplemented, ::testing::HasSubstr("has no kernel matching input types"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+}
+
+TYPED_TEST(TestStringKernels, BinaryRepeatWithArrayRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"([null, "aAazZæÆ&", "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  for (const auto& ty : IntTypes()) {
+    auto num_repeats = ArrayFromJSON(ty, R"([100, 1, 2, 5, 2, 0, 1, 3, 2, 3])");

Review comment:
       Maybe also add a null in the num_repeats? (as that is allowed and will give a null in the result)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731394018



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       Should we add the `position` argument to the Null visitors, but that will require updating all its use cases (~10) and decorating the `position` value with `ARROW_UNUSED(position)`.
   
   Probably this is why these visitors are not used in the unary [`StringTransformExecBase`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L363-L376).
   @pitrou Please advise.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731394018



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       Should we add the `position` argument to the Null visitors, but that will require updating all its use cases (~10) and decorating the `position` value with `ARROW_UNUSED(position)`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r731389134



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {

Review comment:
       It seems we are not able to use the `VisitBitBlocks` utilities because the current implementation needs to set the output string offsets (`output_string_offsets`) when traversing both non-null and null positions, and this requires the `position` being visited for both visitors.
   ```c++
   offset_type output_ncodeunits = 0;
   for (i = 0...) {
     if (!input1.IsNull(i)) {
       ...
       offset_type encoded_bytes = Transform(...);
       ...
       output_ncodeunits += encoded_bytes;
     }
     output_string_offsets[i + 1] = output_ncodeunits;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706358970



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    uint8_t* output_start = output;
+    if (nrepeats > 0) {
+      // log2(k) approach

Review comment:
       Ok, will fix the complexity to `l * log2(k)` where `l` is the length of the input string.
   
   In terms of performance, based on isolated benchmarks I performed comparing several *copy* approaches, the log2 approach is faster for all cases where `nrepeats >= 4`, and for `nrepeats < 4` it was not reasonably slower than direct copies. In my initial PR, I had an `if-else` to handle this, but thought that having the condition check for all values, in addition, to having two approaches, was not better.
   
   This circles back to some of my previous comments/ideas, that the Exec methods should provide a mechanism for selecting kernel `Transform/Call` variants based on these higher-level options. More on this very soon.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-953373581


   The new changes capture invalid repeat counts early-on (in [`PreExec` when Scalar](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2963-R2967) and in [`MaxCodeunit` when Array](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2888)). Note that to capture these errors and trickle up the error to the `ExecXXX` class/methods [here](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R663) and [here](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R676), [a `Status` data member was added to `StringBinaryTransformBase`](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R604). The alternative would be to add a `Status` parameter to `MaxCodeunit()` 
 and `Transform()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740351649



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",
+                           " (", batch[0].ToString(), ", ", batch[1].ToString(),
+                           "). Only Array/Scalar kinds are supported.");

Review comment:
       Since it is not known a priori what are the valid combinations, I decided to remove the second statement from the error message.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",
+                           " (", batch[0].ToString(), ", ", batch[1].ToString(),
+                           "). Only Array/Scalar kinds are supported.");

Review comment:
       Previous statement is false. We actually know the valid combinations, so we can print them as a helpful error comment.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",
+                           " (", batch[0].ToString(), ", ", batch[1].ToString(),
+                           "). Only Array/Scalar kinds are supported.");

Review comment:
       I added it but maybe it is overkill.

##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       Well, [from a previous discussion](https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Stringlike.20kernels.20on.20binary.20data), I am following the pattern that a name with `string` expects/supports both binary and string encoded data. While the `binary` prefix only expects binary non-encoded data and `ascii/utf8` are for encoding-specific functions.
   There are two solutions to be consistent with functions that have either a `binary` or `string` prefix:
   1. Change them all to `binary`
       * `string_repeat` --> `binary_repeat`
       * `string_is_ascii` --> `binary_is_ascii`
   2. Change them all to `string` as they seem to support both binary/string types
       * `binary_length` --> `string_length`
       * `binary_replace_slice` --> `string_replace_slice`
       * `binary_join` --> `string_join`
       * `binary_join_element_wise` --> `string_join_element_wise`

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;
+
+  static int64_t TransformSimple(const uint8_t* input,
+                                 const int64_t input_string_ncodeunits,
+                                 const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static int64_t TransformDoubling(const uint8_t* input,
+                                   const int64_t input_string_ncodeunits,
+                                   const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;
+    std::memcpy(output, output_start, rem);
+    output += rem;
+    return output - output_start;
+  }
+
+  static int64_t TransformWrapper(const uint8_t* input,
+                                  const int64_t input_string_ncodeunits,
+                                  const int64_t num_repeats, uint8_t* output,
+                                  Status* st) {
+    auto transform = (num_repeats < 4) ? TransformSimple : TransformDoubling;
+    return transform(input, input_string_ncodeunits, num_repeats, output, st);
+  }
+
+  Status PreExec(KernelContext*, const ExecBatch& batch, Datum*) override {
+    // For cases with a scalar repeat count, select the best implementation once
+    // before execution. Otherwise, use TransformWrapper to select implementation
+    // when processing each value.

Review comment:
       I did not measured this so will run benchmarks to compare.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;

Review comment:
       Needs to be `Transform` because the string binary exec expects `StringTransforms` to have a `Transform` method to the given signature.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();

Review comment:
       I was simply following the convention used for the unary string transform cases, but it definitely fits the bill here. Thanks!

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2877,159 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const int64_t num_repeats,
+                       Status*) override {
+    return input1_ncodeunits * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const int64_t num_repeats,
+                       Status*) override {
+    return input1.total_values_length() * num_repeats;
+  }
+
+  int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2& input2,
+                       Status* st) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      if (num_repeats < 0) {
+        *st = InvalidRepeatCount();
+        return num_repeats;
+      }
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  std::function<int64_t(const uint8_t*, const int64_t, const int64_t, uint8_t*, Status*)>
+      Transform;
+
+  static int64_t TransformSimple(const uint8_t* input,
+                                 const int64_t input_string_ncodeunits,
+                                 const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static int64_t TransformDoubling(const uint8_t* input,
+                                   const int64_t input_string_ncodeunits,
+                                   const int64_t num_repeats, uint8_t* output, Status*) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;
+    std::memcpy(output, output_start, rem);
+    output += rem;
+    return output - output_start;
+  }
+
+  static int64_t TransformWrapper(const uint8_t* input,
+                                  const int64_t input_string_ncodeunits,
+                                  const int64_t num_repeats, uint8_t* output,
+                                  Status* st) {
+    auto transform = (num_repeats < 4) ? TransformSimple : TransformDoubling;
+    return transform(input, input_string_ncodeunits, num_repeats, output, st);
+  }
+
+  Status PreExec(KernelContext*, const ExecBatch& batch, Datum*) override {
+    // For cases with a scalar repeat count, select the best implementation once
+    // before execution. Otherwise, use TransformWrapper to select implementation
+    // when processing each value.

Review comment:
       Using `std::function` indirection resulted in 2x slower, so good call/intuition on this one.
   ```
   StringRepeat_mean    622822087 ns    622817699 ns           10 bytes_per_second=25.4509M/s items_per_second=1.68498M/s
   StringRepeat_median  623393064 ns    623390528 ns           10 bytes_per_second=25.4067M/s items_per_second=1.68205M/s
   StringRepeat_stddev   18771545 ns     18770743 ns           10 bytes_per_second=787.511k/s items_per_second=50.9153k/s
   ```
   Checking `num_repeats < 4` at each iteration
   ```
   StringRepeat_mean    313125674 ns    313123902 ns           10 bytes_per_second=50.601M/s items_per_second=3.35004M/s
   StringRepeat_median  312795031 ns    312794088 ns           10 bytes_per_second=50.6405M/s items_per_second=3.35266M/s
   StringRepeat_stddev    6484104 ns      6484645 ns           10 bytes_per_second=1068k/s items_per_second=69.0502k/s
   ```

##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       I think solution 2 is better.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -19,6 +19,7 @@
 #include <cctype>
 #include <iterator>
 #include <string>
+#include <typeinfo>

Review comment:
       No, I used it when trying to print the StringTransform type using `typeid(t).name()` but it printed more info than needed.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -19,6 +19,7 @@
 #include <cctype>
 #include <iterator>
 #include <string>
+#include <typeinfo>

Review comment:
       No, I used it when trying to print the `StringTransform` type using `typeid(t).name()` but it printed more info than needed.

##########
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##########
@@ -20,6 +20,7 @@
 #include <memory>
 #include <string>
 #include <utility>
+#include <vector>

Review comment:
       Not for this PR, so I will revert. Nevertheless, I have noticed that there are several imports missing and probably some extra in several files. I think this should be its own JIRA issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741811378



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, BinaryRepeatWithScalarRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
+      {0, R"(["", null, "", "", "", "", "", "", "", ""])"},
+      {1, R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!",
+              "$. A3", "!ɑⱤⱤow"])"},
+      {4, R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb",
+              "ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ",
+              "hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!",
+              "$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
+  }};
+
+  for (const auto& pair : nrepeats_and_expected) {
+    auto num_repeat = pair.first;
+    auto expected = pair.second;
+    for (const auto& ty : IntTypes()) {
+      this->CheckVarArgs("binary_repeat",
+                         {values, Datum(*arrow::MakeScalar(ty, num_repeat))},
+                         this->type(), expected);
+    }
+  }
+
+  // Negative repeat count
+  for (auto num_repeat_ : {-1, -2, -5}) {
+    auto num_repeat = *arrow::MakeScalar(int64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        Invalid, ::testing::HasSubstr("Repeat count must be a non-negative integer"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+
+  // Floating-point repeat count
+  for (auto num_repeat_ : {0.0, 1.2, -1.3}) {
+    auto num_repeat = *arrow::MakeScalar(float64(), num_repeat_);
+    EXPECT_RAISES_WITH_MESSAGE_THAT(
+        NotImplemented, ::testing::HasSubstr("has no kernel matching input types"),
+        CallFunction("binary_repeat", {values, num_repeat}));
+  }
+}
+
+TYPED_TEST(TestStringKernels, BinaryRepeatWithArrayRepeat) {
+  auto values = ArrayFromJSON(this->type(),
+                              R"([null, "aAazZæÆ&", "", "b", "ɑɽⱤoW", "ıI",
+                                  "ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
+  for (const auto& ty : IntTypes()) {
+    auto num_repeats = ArrayFromJSON(ty, R"([100, 1, 2, 5, 2, 0, 1, 3, 2, 3])");

Review comment:
       Maybe also add a null in the num_repeats? (as that is allowed and will give a null in the result)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741135553



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
       R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo   Bar;Héhé0Zop", "!%$^.,;"])");
 }
 
+TYPED_TEST(TestStringKernels, StringRepeatWithScalarRepeat) {

Review comment:
       Is there a place where passing a scalar for the strings argument is tested? Is it implicit in `CheckVarArgs`?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2513,6 +2878,135 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+/// An ScalarFunction that promotes integer arguments to Int64.
+struct ScalarCTypeToInt64Function : public ScalarFunction {
+  using ScalarFunction::ScalarFunction;
+
+  Result<const Kernel*> DispatchBest(std::vector<ValueDescr>* values) const override {
+    RETURN_NOT_OK(CheckArity(*values));
+
+    using arrow::compute::detail::DispatchExactImpl;
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+
+    EnsureDictionaryDecoded(values);
+
+    for (auto& descr : *values) {
+      if (is_integer(descr.type->id())) {
+        descr.type = int64();
+      }
+    }
+
+    if (auto kernel = DispatchExactImpl(this, *values)) return kernel;
+    return arrow::compute::detail::NoMatchingKernel(this, *values);
+  }
+};
+
+template <typename Type1, typename Type2>
+struct StringRepeatTransform : public StringBinaryTransformBase<Type1, Type2> {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1_ncodeunits * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const int64_t input1_ncodeunits,
+                               const ArrayType2& input2) override {
+    int64_t total_num_repeats = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_num_repeats += num_repeats;
+    }
+    return input1_ncodeunits * total_num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const int64_t num_repeats) override {
+    ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+    return input1.total_values_length() * num_repeats;
+  }
+
+  Result<int64_t> MaxCodeunits(const ArrayType1& input1,
+                               const ArrayType2& input2) override {
+    int64_t total_codeunits = 0;
+    for (int64_t i = 0; i < input2.length(); ++i) {
+      auto num_repeats = input2.GetView(i);
+      ARROW_RETURN_NOT_OK(ValidateRepeatCount(num_repeats));
+      total_codeunits += input1.GetView(i).length() * num_repeats;
+    }
+    return total_codeunits;
+  }
+
+  static Result<int64_t> TransformSimpleLoop(const uint8_t* input,
+                                             const int64_t input_string_ncodeunits,
+                                             const int64_t num_repeats, uint8_t* output) {
+    uint8_t* output_start = output;
+    for (int64_t i = 0; i < num_repeats; ++i) {
+      std::memcpy(output, input, input_string_ncodeunits);
+      output += input_string_ncodeunits;
+    }
+    return output - output_start;
+  }
+
+  static Result<int64_t> TransformDoublingString(const uint8_t* input,
+                                                 const int64_t input_string_ncodeunits,
+                                                 const int64_t num_repeats,
+                                                 uint8_t* output) {
+    uint8_t* output_start = output;
+    // Repeated doubling of string
+    std::memcpy(output, input, input_string_ncodeunits);
+    output += input_string_ncodeunits;
+    int64_t i = 1;
+    for (int64_t ilen = input_string_ncodeunits; i <= (num_repeats / 2);
+         i *= 2, ilen *= 2) {
+      std::memcpy(output, output_start, ilen);
+      output += ilen;
+    }
+
+    // Epilogue remainder
+    int64_t rem = (num_repeats ^ i) * input_string_ncodeunits;

Review comment:
       Is there a particular reason for xoring here? I guess it's fine, but it seems like this is really a subtraction?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -812,45 +812,47 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| Function name           | Arity | Input types            | Output type            | Options class                     | Notes |
-+=========================+=======+========================+========================+===================================+=======+
-| ascii_capitalize        | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_lower             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_reverse           | Unary | String-like            | String-like            |                                   | \(2)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_swapcase          | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_title             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| ascii_upper             | Unary | String-like            | String-like            |                                   | \(1)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_length           | Unary | Binary- or String-like | Int32 or Int64         |                                   | \(3)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| binary_replace_slice    | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring       | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(5)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| replace_substring_regex | Unary | Binary- or String-like | Binary- or String-like | :struct:`ReplaceSubstringOptions` | \(6)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_capitalize         | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_length             | Unary | String-like            | Int32 or Int64         |                                   | \(7)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_lower              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_replace_slice      | Unary | String-like            | String-like            | :struct:`ReplaceSliceOptions`     | \(4)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_reverse            | Unary | String-like            | String-like            |                                   | \(9)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_swapcase           | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_title              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
-| utf8_upper              | Unary | String-like            | String-like            |                                   | \(8)  |
-+-------------------------+-------+------------------------+------------------------+-----------------------------------+-------+
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| Function name           | Arity  | Input types                             | Output type            | Options class                     | Notes |
++=========================+========+=========================================+========================+===================================+=======+
+| ascii_capitalize        | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_lower             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_reverse           | Unary  | String-like                             | String-like            |                                   | \(2)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_swapcase          | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_title             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| ascii_upper             | Unary  | String-like                             | String-like            |                                   | \(1)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_length           | Unary  | Binary- or String-like                  | Int32 or Int64         |                                   | \(3)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| binary_replace_slice    | Unary  | String-like                             | Binary- or String-like | :struct:`ReplaceSliceOptions`     | \(4)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring       | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(5)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| replace_substring_regex | Unary  | String-like                             | String-like            | :struct:`ReplaceSubstringOptions` | \(6)  |
++-------------------------+--------+-----------------------------------------+------------------------+-----------------------------------+-------+
+| string_repeat           | Binary | Binary/String (Arg 0); Integral (Arg 1) | Binary- or String-like |                                   | \(7)  |

Review comment:
       Both solutions are fine to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-961085402


   Benchmark runs are scheduled for baseline = 5897217ec5ee6f4f58373362a76a70618921c128 and contender = 0ead7c906dafb73c2b2829681845fe5a808a54e9. 0ead7c906dafb73c2b2829681845fe5a808a54e9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/889ba7b7a56b485ea9df25c008235283...682c7eeb6129469c84e53494bb85219e/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/091cc986a04b485bbb53f8d006dd066b...0fd03cce4b804a98883de9f7b6c658c7/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/32da2cf3bb9448869c2ef7f8003106a9...5c488ba9051d4cee94d0ec52741539fb/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm closed pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #11023:
URL: https://github.com/apache/arrow/pull/11023


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r740353081



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -537,6 +549,358 @@ struct FixedSizeBinaryTransformExecWithState
   }
 };
 
+template <typename Type1, typename Type2>
+struct StringBinaryTransformBase {
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  virtual ~StringBinaryTransformBase() = default;
+
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  virtual Status InvalidInputSequence() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics for different input shapes.
+  // The Status parameter should only be set if an error needs to be signaled.
+
+  // Scalar-Scalar
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ViewType2,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Scalar-Array
+  virtual int64_t MaxCodeunits(const int64_t input1_ncodeunits, const ArrayType2&,
+                               Status*) {
+    return input1_ncodeunits;
+  }
+
+  // Array-Scalar
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ViewType2, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Array-Array
+  virtual int64_t MaxCodeunits(const ArrayType1& input1, const ArrayType2&, Status*) {
+    return input1.total_values_length();
+  }
+
+  // Not all combinations of input shapes are meaningful to string binary
+  // transforms, so these flags serve as control toggles for enabling/disabling
+  // the corresponding ones. These flags should be set in the PreExec() method.
+  //
+  // This is an example of a StringTransform that disables support for arguments
+  // with mixed Scalar/Array shapes.
+  //
+  // template <typename Type1, typename Type2>
+  // struct MyStringTransform : public StringBinaryTransformBase<Type1, Type2> {
+  //   Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) override {
+  //     EnableScalarArray = false;
+  //     EnableArrayScalar = false;
+  //     return StringBinaryTransformBase::PreExec(ctx, batch, out);
+  //   }
+  //   ...
+  // };
+  bool EnableScalarScalar = true;
+  bool EnableScalarArray = true;
+  bool EnableArrayScalar = true;
+  bool EnableArrayArray = true;
+};
+
+/// Kernel exec generator for binary (two parameters) string transforms.
+/// The first parameter is expected to always be a Binary/StringType while the
+/// second parameter is generic. Types of template parameter StringTransform
+/// need to define a transform method with the following signature:
+///
+/// int64_t Transform(const uint8_t* input, const int64_t input_string_ncodeunits,
+///                   const ViewType2 value2, uint8_t* output, Status* st);
+///
+/// where
+///   * `input` - input sequence (binary or string)
+///   * `input_string_ncodeunits` - length of input sequence in codeunits
+///   * `value2` - second argument to the string transform
+///   * `output` - output sequence (binary or string)
+///   * `st` - Status code, only set if transform needs to signal an error
+///
+/// and returns the number of codeunits of the `output` sequence or a negative
+/// value if an invalid input sequence is detected.
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ViewType2 = typename GetViewType<Type2>::T;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch[0].is_scalar()) {
+      if (batch[1].is_scalar()) {
+        if (transform->EnableScalarScalar) {
+          return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                  out);
+        }
+      } else if (batch[1].is_array()) {
+        if (transform->EnableScalarArray) {
+          return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(),
+                                 out);
+        }
+      }
+    } else if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        if (transform->EnableArrayArray) {
+          return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+        }
+      } else if (batch[1].is_scalar()) {
+        if (transform->EnableArrayScalar) {
+          return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(),
+                                 out);
+        }
+      }
+    }
+    return Status::Invalid("Invalid combination of operands for binary string transform",
+                           " (", batch[0].ToString(), ", ", batch[1].ToString(),
+                           "). Only Array/Scalar kinds are supported.");

Review comment:
       Previous statement is false. We actually know the valid combinations, so we can print them as a helpful error comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-960061949






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r741575436



##########
File path: r/R/expression.R
##########
@@ -100,7 +100,8 @@
   # use `%/%` above.
   "%%" = "divide_checked",
   "^" = "power_checked",
-  "%in%" = "is_in_meta_binary"
+  "%in%" = "is_in_meta_binary",
+  "strrep" = "binary_repeat"

Review comment:
       `strrep()` is a function in base R. There is also a function [`str_dup()`](https://stringr.tidyverse.org/reference/str_dup.html) in the popular R package **stringr** that does exactly the same thing. In the R bindings we often like to add these **stringr** variants of the functions too:
   ```suggestion
     "strrep" = "binary_repeat",
     "str_dup" = "binary_repeat"
   ```

##########
File path: r/tests/testthat/test-dplyr-funcs-string.R
##########
@@ -467,6 +467,18 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("strrep", {
+  df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
+  for (times in 0:8L) {
+    compare_dplyr_binding(
+      .input %>%
+        mutate(x = strrep(x, times)) %>%
+        collect(),
+      df
+    )
+  }
+})
+

Review comment:
       Adds a test for the `str_dup()` binding I suggested above. Also FYI you don't need the `L` after `8` because the `:` operator in R always creates integer vectors when its operands are whole numbers.
   ```suggestion
   test_that("strrep, str_dup", {
     df <- tibble(x = c("foo1", " \tB a R\n", "!apACHe aRroW!"))
     for (times in 0:8) {
       compare_dplyr_binding(
         .input %>%
           mutate(x = strrep(x, times)) %>%
           collect(),
         df
       )
       compare_dplyr_binding(
         .input %>%
           mutate(x = str_dup(x, times)) %>%
           collect(),
         df
       )
     }
   })
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-947108791


   I agree with David about erroring: it seems a bit odd to me that negative values are silently assumed to be 0.
   
   In the first message, you have the following error:
   
   ```
   Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R
   ```
   
   That seems odd, I would expect `3` to be fine, is there a missing `-` somewhere?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-947176289


   > For 1: why support floating point or boolean arguments in the first place? It seems quite odd. I think those should be explicit casts if the user wants them, and they can choose safe/unsafe cast as appropriate.
   
   I was trying to mimick Python's behavior and implicit casts, but I agree that this should be deferred to explicit casts if needed. Also, I think returning an error for invalid repeat value is more reasonable than silently changing the repeat count.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce edited a comment on pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce edited a comment on pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#issuecomment-947272294


   [Validation of repeat count occurs here](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R2903-R2908) but only for when it is a Scalar value. If it is an Array, no validation occurs and error is delegated to [output allocation](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R776).
   
   I understand we should be consistent, so should we validate repeat count for Arrays as well and accept the performance hit, or should we not validate at all and let the output allocation error out?
   
   It is difficult to error out from inside the transform because it [does not output a `Status`, simply the number of transformed/encoded bytes](https://github.com/apache/arrow/pull/11023/files#diff-eb8300bc4dea7d1c46b2576b7dbd8e42b927ab7d42c031f4aecae892a72ee244R793) contrast to the arithmetic kernels which have a `Status` parameter.
   
   Based on this thought process, I am leaning towards either no validation at all or add a `Status` parameter to the string transform and perform validation during processing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706925612



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -2357,6 +2584,79 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+template <typename Type1, typename Type2>
+struct StrRepeatTransform : public StringBinaryTransformBase {
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<Scalar>& input2) override {
+    auto nrepeats = static_cast<int64_t>(UnboxScalar<Type2>::Unbox(*input2));
+    return std::max(input_ncodeunits * nrepeats, int64_t(0));
+  }
+
+  int64_t MaxCodeunits(int64_t inputs, int64_t input_ncodeunits,
+                       const std::shared_ptr<ArrayData>& data2) override {
+    ArrayType2 array2(data2);
+    // Ideally, we would like to calculate the exact output size by iterating over
+    // all strings offsets and summing each length multiplied by the corresponding repeat
+    // value, but this requires traversing the data twice (now and during transform).
+    // The upper limit is to assume that all strings are repeated the max number of
+    // times knowing that a resize operation is performed at end of execution.
+    auto max_nrepeats =
+        static_cast<int64_t>(**std::max_element(array2.begin(), array2.end()));
+    return std::max(input_ncodeunits * max_nrepeats, int64_t(0));
+  }
+
+  int64_t Transform(const uint8_t* input, int64_t input_string_ncodeunits,
+                    const std::shared_ptr<Scalar>& input2, uint8_t* output) {

Review comment:
       After using `GetView()` now we do have the raw data type, so it works. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706345457



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");

Review comment:
       I copied the same message as in "unary" `StringTransformExecBase`. Note, that the term _array_ refers to the buffer holding the string value which is allocated with size of `MaxCodeUnits()`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #11023: ARROW-12712: [C++] String repeat kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #11023:
URL: https://github.com/apache/arrow/pull/11023#discussion_r706353661



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -417,6 +419,231 @@ struct StringTransformExecWithState
   }
 };
 
+struct StringBinaryTransformBase {
+  virtual Status PreExec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return Status::OK();
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<Scalar>& input2) {
+    return input_ncodeunits;
+  }
+
+  // Return the maximum total size of the output in codeunits (i.e. bytes)
+  // given input characteristics.
+  virtual int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits,
+                               const std::shared_ptr<ArrayData>& data2) {
+    return input_ncodeunits;
+  }
+
+  virtual Status InvalidStatus() {
+    return Status::Invalid("Invalid UTF8 sequence in input");
+  }
+};
+
+/// Kernel exec generator for binary string transforms.
+/// The first parameter is expected to always be a string type while the second parameter
+/// is generic. It supports executions of the form:
+///   * Scalar, Scalar
+///   * Array, Scalar - scalar is broadcasted and paired with all values of array
+///   * Array, Array - arrays are processed element-wise
+///   * Scalar, Array - not supported by default
+template <typename Type1, typename Type2, typename StringTransform>
+struct StringBinaryTransformExecBase {
+  using offset_type = typename Type1::offset_type;
+  using ArrayType1 = typename TypeTraits<Type1>::ArrayType;
+  using ArrayType2 = typename TypeTraits<Type2>::ArrayType;
+
+  static Status Execute(KernelContext* ctx, StringTransform* transform,
+                        const ExecBatch& batch, Datum* out) {
+    if (batch.num_values() != 2) {
+      return Status::Invalid("Invalid arity for binary string transform");
+    }
+
+    if (batch[0].is_array()) {
+      if (batch[1].is_array()) {
+        return ExecArrayArray(ctx, transform, batch[0].array(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecArrayScalar(ctx, transform, batch[0].array(), batch[1].scalar(), out);
+      }
+    } else if (batch[0].is_scalar()) {
+      if (batch[1].is_array()) {
+        return ExecScalarArray(ctx, transform, batch[0].scalar(), batch[1].array(), out);
+      } else if (batch[1].is_scalar()) {
+        return ExecScalarScalar(ctx, transform, batch[0].scalar(), batch[1].scalar(),
+                                out);
+      }
+    }
+    return Status::Invalid("Invalid ExecBatch kind for binary string transform");
+  }
+
+  static Status ExecScalarScalar(KernelContext* ctx, StringTransform* transform,
+                                 const std::shared_ptr<Scalar>& scalar1,
+                                 const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar1->is_valid || !scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    const auto& input1 = checked_cast<const BaseBinaryScalar&>(*scalar1);
+    auto input_ncodeunits = input1.value->size();
+    auto input_nstrings = 1;
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input_nstrings, input_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ARROW_ASSIGN_OR_RAISE(auto value_buffer, ctx->Allocate(output_ncodeunits_max));
+    auto result = checked_cast<BaseBinaryScalar*>(out->scalar().get());
+    result->is_valid = true;
+    result->value = value_buffer;
+    auto output_str = value_buffer->mutable_data();
+
+    auto input1_string = input1.value->data();
+    auto encoded_nbytes = static_cast<offset_type>(
+        transform->Transform(input1_string, input_ncodeunits, scalar2, output_str));
+    if (encoded_nbytes < 0) {
+      return transform->InvalidStatus();
+    }
+    DCHECK_LE(encoded_nbytes, output_ncodeunits_max);
+    return value_buffer->Resize(encoded_nbytes, /*shrink_to_fit=*/true);
+  }
+
+  static Status ExecArrayScalar(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<ArrayData>& data1,
+                                const std::shared_ptr<Scalar>& scalar2, Datum* out) {
+    if (!scalar2->is_valid) {
+      return Status::OK();
+    }
+
+    ArrayType1 input1(data1);
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, scalar2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto encoded_nbytes = static_cast<offset_type>(
+            transform->Transform(input1_string, input1_string_ncodeunits, scalar2,
+                                 output_str + output_ncodeunits));
+        if (encoded_nbytes < 0) {
+          return transform->InvalidStatus();
+        }
+        output_ncodeunits += encoded_nbytes;
+      }
+      output_string_offsets[i + 1] = output_ncodeunits;
+    }
+    DCHECK_LE(output_ncodeunits, output_ncodeunits_max);
+
+    // Trim the codepoint buffer, since we allocated too much
+    return values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true);
+    return Status::OK();
+  }
+
+  static Status ExecScalarArray(KernelContext* ctx, StringTransform* transform,
+                                const std::shared_ptr<Scalar>& scalar1,
+                                const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    return Status::NotImplemented(
+        "Binary string transforms with (scalar, array) inputs are not supported for the "
+        "general case");
+  }
+
+  static Status ExecArrayArray(KernelContext* ctx, StringTransform* transform,
+                               const std::shared_ptr<ArrayData>& data1,
+                               const std::shared_ptr<ArrayData>& data2, Datum* out) {
+    ArrayType1 input1(data1);
+    ArrayType2 input2(data2);
+
+    auto input1_ncodeunits = input1.total_values_length();
+    auto input1_nstrings = input1.length();
+    auto output_ncodeunits_max =
+        transform->MaxCodeunits(input1_nstrings, input1_ncodeunits, data2);
+    if (output_ncodeunits_max > std::numeric_limits<offset_type>::max()) {
+      return Status::CapacityError(
+          "Result might not fit in a 32bit utf8 array, convert to large_utf8");
+    }
+
+    ArrayData* output = out->mutable_array();
+    ARROW_ASSIGN_OR_RAISE(auto values_buffer, ctx->Allocate(output_ncodeunits_max));
+    output->buffers[2] = values_buffer;
+
+    // String offsets are preallocated
+    auto output_string_offsets = output->GetMutableValues<offset_type>(1);
+    auto output_str = output->buffers[2]->mutable_data();
+    output_string_offsets[0] = 0;
+
+    offset_type output_ncodeunits = 0;
+    for (int64_t i = 0; i < input1_nstrings; ++i) {
+      if (!input1.IsNull(i) || !input2.IsNull(i)) {
+        offset_type input1_string_ncodeunits;
+        auto input1_string = input1.GetValue(i, &input1_string_ncodeunits);
+        auto scalar2 = *input2.GetScalar(i);

Review comment:
       We agree that we can improve on impl/performance but at least this first pass allows adding binary kernels more easily.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org