You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/15 10:04:50 UTC

[GitHub] [arrow] maartenbreddels opened a new pull request #8468: ARROW-10306: [C++] Add string replacement kernel

maartenbreddels opened a new pull request #8468:
URL: https://github.com/apache/arrow/pull/8468


   Two new kernels
    * replace_substring like Python's str.replace
    * replace_substring_re2  like Python's re.sub
   
   TODO:
    * [ ] Rebase after #8459 is not, and the re2 installation/linking is solved
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-732844513


   @maartenbreddels would it be practical to split this into two PRs? (one for plain replace, other for re2-based replace) Or would you prefer first to have the re2 dependency issue resolved? (ARROW-10541)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #8468:
URL: https://github.com/apache/arrow/pull/8468


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r593128772



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {

Review comment:
       This pattern occurs more often in the file, I didn't realize this lead to slower compilation and probably larger binary sizes. I think it requires a refactor that is larger than this PR. Also, I won't have the time currently to do this. Can we merge this as is, and I'll open a Jira issue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-712295030


   You may want to remove the regex variant of this if you want to move this forward without depending on resolving the re2 dependency issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-733870297


   @pitrou this is ready for review, failure seems unrelated (minio on windows).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-762872492


   @pitrou Yes, apart from an unanswered question this is ready for review 👍 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r560240161



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {

Review comment:
       Well, I don't understand why `offset_type` is being used here. `ValueDataBuilder` is basically a `TypedBufferBuilder<uint8_t>`, it's used for building the string data, it doesn't deal with string offsets.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r545637801



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {

Review comment:
       Maybe I misunderstand, but via `offset_type` we are not independent of `Type` right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-807541121


   Haven't seen that before, and we've been building with re2 for months now. Maybe this is the first time we're building something that actually uses re2? It's possible that re2 needs a "backport" library built with the rtools3.5 toolchain; an immediate workaround could be to build with `-DRE2_SOURCE=BUNDLED` here, or to make `ARROW_WITH_RE2` conditional on the toolchain (like we do for `ARROW_S3` already). Given that we won't have to support the rtools3.5 toolchain after April or May, I'll try just turning it off here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-762854225


   @maartenbreddels Is it ready for review again? Feel free to ping me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-807607431


   @github-actions crossbow submit -g r


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-733803300


   I'd rather keep this 1 PR, looks like #8756 is working


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r560262935



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {

Review comment:
       Ok, yes, I see it now. I'm using `offset_type` in this class as well, which I shouldn't, I think that's what led me to this. This requires a bit of refactoring.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-806722446


   (gentle ping here, would really like to see those PRs merged for 4.0 in April!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-807095671


   I'll rebase and update this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r531163812



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {

Review comment:
       `ReplaceString` below is basically independent from `Type`, but using this idiom may compile it twice. Can you find another way to parametrize the kernel?
   (hint: perhaps use composition rather than inheritance)

##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -68,6 +68,18 @@ struct ARROW_EXPORT SplitPatternOptions : public SplitOptions {
   std::string pattern;
 };
 
+struct ARROW_EXPORT ReplaceSubstringOptions : public FunctionOptions {
+  explicit ReplaceSubstringOptions(std::string pattern, std::string replacement,
+                                   int64_t max_replacements = -1)
+      : pattern(pattern), replacement(replacement), max_replacements(max_replacements) {}
+
+  /// Literal pattern, or regular expression depending on is_regex

Review comment:
       Hmm... I don't see `is_regex` here?

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {
+  using Base = ReplaceSubStringBase<Type, ReplaceSubString<Type>>;
+  using ValueDataBuilder = typename Base::ValueDataBuilder;
+  using offset_type = typename Base::offset_type;
+
+  ReplaceSubstringOptions options;
+  explicit ReplaceSubString(KernelContext* ctx, ReplaceSubstringOptions options)
+      : options(options) {}
+
+  Status ReplaceString(util::string_view s, ValueDataBuilder* builder) {
+    const char* i = s.begin();
+    const char* end = s.end();
+    int64_t max_replacements = options.max_replacements;
+    while ((i < end) && (max_replacements != 0)) {
+      const char* pos =
+          std::search(i, end, options.pattern.begin(), options.pattern.end());
+      if (pos == end) {
+        RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                      static_cast<offset_type>(end - i)));
+        i = end;
+      } else {
+        // the string before the pattern
+        RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                      static_cast<offset_type>(pos - i)));
+        // the replacement
+        RETURN_NOT_OK(
+            builder->Append(reinterpret_cast<const uint8_t*>(options.replacement.data()),
+                            options.replacement.length()));
+        // skip pattern
+        i = pos + options.pattern.length();
+        max_replacements--;
+      }
+    }
+    // if we exited early due to max_replacements, add the trailing part
+    RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                  static_cast<offset_type>(end - i)));
+    return Status::OK();
+  }
+};
+
+const FunctionDoc replace_substring_doc(
+    "Replace non-overlapping substrings that match pattern by replacement",
+    ("For each string in `strings`, replace non-overlapping substrings that match\n"
+     "`pattern` by `replacement`. If `max_replacements != -1`, it determines the\n"
+     "maximum amount of replacements made, counting from the left. Null values emit\n"
+     "null."),
+    {"strings"}, "ReplaceSubstringOptions");
+
+#ifdef ARROW_WITH_RE2
+template <typename Type>
+struct ReplaceSubStringRE2 : ReplaceSubStringBase<Type, ReplaceSubStringRE2<Type>> {
+  using Base = ReplaceSubStringBase<Type, ReplaceSubStringRE2<Type>>;
+  using ValueDataBuilder = typename Base::ValueDataBuilder;
+  using offset_type = typename Base::offset_type;
+
+  ReplaceSubstringOptions options;
+  RE2 regex_find;
+  RE2 regex_replacement;
+  explicit ReplaceSubStringRE2(KernelContext* ctx, ReplaceSubstringOptions options)
+      : options(options),
+        regex_find("(" + options.pattern + ")"),
+        regex_replacement(options.pattern) {
+    // Using RE2::FindAndConsume we can only find the pattern if it is a group, therefore
+    // we have 2 regex, one with () around it, one without.
+    if (!(regex_find.ok() && regex_replacement.ok())) {
+      ctx->SetStatus(Status::Invalid("Regular expression error"));
+      return;
+    }
+  }
+  Status ReplaceString(util::string_view s, ValueDataBuilder* builder) {
+    re2::StringPiece replacement(options.replacement);
+    if (options.max_replacements == -1) {
+      std::string s_copy(s.to_string());
+      re2::RE2::GlobalReplace(&s_copy, regex_replacement, replacement);
+      RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(s_copy.data()),
+                                    s_copy.length()));
+      return Status::OK();
+    }
+    // Since RE2 does not have the concept of max_replacements, we have to do some work
+    // ourselves.

Review comment:
       Note that the `GlobalReplace` loop works a bit differently, it calls `Match` then `Rewrite`, avoiding the duplicate matching calls. Not sure it's worth optimizing this, though:
   https://github.com/google/re2/blob/master/re2/re2.cc#L427

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {
+  using Base = ReplaceSubStringBase<Type, ReplaceSubString<Type>>;
+  using ValueDataBuilder = typename Base::ValueDataBuilder;
+  using offset_type = typename Base::offset_type;
+
+  ReplaceSubstringOptions options;
+  explicit ReplaceSubString(KernelContext* ctx, ReplaceSubstringOptions options)
+      : options(options) {}
+
+  Status ReplaceString(util::string_view s, ValueDataBuilder* builder) {
+    const char* i = s.begin();
+    const char* end = s.end();
+    int64_t max_replacements = options.max_replacements;
+    while ((i < end) && (max_replacements != 0)) {
+      const char* pos =
+          std::search(i, end, options.pattern.begin(), options.pattern.end());
+      if (pos == end) {
+        RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                      static_cast<offset_type>(end - i)));
+        i = end;
+      } else {
+        // the string before the pattern
+        RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                      static_cast<offset_type>(pos - i)));
+        // the replacement
+        RETURN_NOT_OK(
+            builder->Append(reinterpret_cast<const uint8_t*>(options.replacement.data()),
+                            options.replacement.length()));
+        // skip pattern
+        i = pos + options.pattern.length();
+        max_replacements--;
+      }
+    }
+    // if we exited early due to max_replacements, add the trailing part
+    RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                  static_cast<offset_type>(end - i)));
+    return Status::OK();
+  }
+};
+
+const FunctionDoc replace_substring_doc(
+    "Replace non-overlapping substrings that match pattern by replacement",
+    ("For each string in `strings`, replace non-overlapping substrings that match\n"
+     "`pattern` by `replacement`. If `max_replacements != -1`, it determines the\n"
+     "maximum amount of replacements made, counting from the left. Null values emit\n"
+     "null."),
+    {"strings"}, "ReplaceSubstringOptions");
+
+#ifdef ARROW_WITH_RE2
+template <typename Type>
+struct ReplaceSubStringRE2 : ReplaceSubStringBase<Type, ReplaceSubStringRE2<Type>> {

Review comment:
       Similarly as above, this looks basically independent from `Type`.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -416,6 +424,28 @@ TYPED_TEST(TestStringKernels, SplitWhitespaceUTF8Reverse) {
                    &options_max);
 }
 
+#ifdef ARROW_WITH_RE2
+TYPED_TEST(TestStringKernels, ReplaceSubstringNormal) {
+  ReplaceSubstringOptions options{"foo", "bazz"};
+  this->CheckUnary("replace_substring", R"(["foo", "this foo that foo", null])",
+                   this->type(), R"(["bazz", "this bazz that bazz", null])", &options);
+  ReplaceSubstringOptions options_regex{"(fo+)\\s*", "\\1-bazz", -1};
+  this->CheckUnary("replace_substring_re2", R"(["foo ", "this foo   that foo", null])",
+                   this->type(), R"(["foo-bazz", "this foo-bazzthat foo-bazz", null])",
+                   &options_regex);

Review comment:
       Can you add a test with potential tricky cases? For example `text="aaaaaa", match="(a.a)", replacement="ab\1"`.

##########
File path: docs/source/cpp/compute.rst
##########
@@ -355,19 +355,23 @@ The third set of functions examines string elements on a byte-per-byte basis:
 String transforms
 ~~~~~~~~~~~~~~~~~
 
-+--------------------------+------------+-------------------------+---------------------+---------+
-| Function name            | Arity      | Input types             | Output type         | Notes   |
-+==========================+============+=========================+=====================+=========+
-| ascii_lower              | Unary      | String-like             | String-like         | \(1)    |
-+--------------------------+------------+-------------------------+---------------------+---------+
-| ascii_upper              | Unary      | String-like             | String-like         | \(1)    |
-+--------------------------+------------+-------------------------+---------------------+---------+
-| binary_length            | Unary      | Binary- or String-like  | Int32 or Int64      | \(2)    |
-+--------------------------+------------+-------------------------+---------------------+---------+
-| utf8_lower               | Unary      | String-like             | String-like         | \(3)    |
-+--------------------------+------------+-------------------------+---------------------+---------+
-| utf8_upper               | Unary      | String-like             | String-like         | \(3)    |
-+--------------------------+------------+-------------------------+---------------------+---------+
++--------------------------+------------+-------------------------+---------------------+-------------------------------------------------+
+| Function name            | Arity      | Input types             | Output type         | Notes   | Options class                         |
++==========================+============+=========================+=====================+=========+=======================================+
+| ascii_lower              | Unary      | String-like             | String-like         | \(1)    |                                       |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| ascii_upper              | Unary      | String-like             | String-like         | \(1)    |                                       |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| binary_length            | Unary      | Binary- or String-like  | Int32 or Int64      | \(2)    |                                       |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| replace_substring        | Unary      | String-like             | String-like         | \(3)    | :struct:`ReplaceSubstringOptions`     |
++--------------------------+------------+-------------------------+---------------------+---------+---------------------------------------+
+| replace_substring_re2    | Unary      | String-like             | String-like         | \(4)    | :struct:`ReplaceSubstringOptions`     |

Review comment:
       Please don't put "re2" in any of the public names or APIs. Using the re2 library is just an implementation detail.
   "replace_regex" sounds just as good.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r596070474



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {

Review comment:
       Well, you don't need to refactor other kernels for now, but I suppose this one could easily be adapted, no? :-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-807336839


   Hmm, there's a re2 link error on RTools 3.5:
   https://github.com/pitrou/arrow/runs/2195976671?check_suite_focus=true#step:12:178
   
   @nealrichardson 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#discussion_r545869942



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -1194,6 +1198,197 @@ void AddSplit(FunctionRegistry* registry) {
 #endif
 }
 
+// ----------------------------------------------------------------------
+// replace substring
+
+template <typename Type, typename Derived>
+struct ReplaceSubStringBase {
+  using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using ScalarType = typename TypeTraits<Type>::ScalarType;
+  using BuilderType = typename TypeTraits<Type>::BuilderType;
+  using offset_type = typename Type::offset_type;
+  using ValueDataBuilder = TypedBufferBuilder<uint8_t>;
+  using OffsetBuilder = TypedBufferBuilder<offset_type>;
+  using State = OptionsWrapper<ReplaceSubstringOptions>;
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    Derived derived(ctx, State::Get(ctx));
+    if (ctx->status().ok()) {
+      derived.Replace(ctx, batch, out);
+    }
+  }
+  void Replace(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    std::shared_ptr<ValueDataBuilder> value_data_builder =
+        std::make_shared<ValueDataBuilder>();
+    std::shared_ptr<OffsetBuilder> offset_builder = std::make_shared<OffsetBuilder>();
+
+    if (batch[0].kind() == Datum::ARRAY) {
+      // We already know how many strings we have, so we can use Reserve/UnsafeAppend
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Reserve(batch[0].array()->length));
+
+      const ArrayData& input = *batch[0].array();
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Append(0));  // offsets start at 0
+      KERNEL_RETURN_IF_ERROR(
+          ctx, VisitArrayDataInline<Type>(
+                   input,
+                   [&](util::string_view s) {
+                     RETURN_NOT_OK(static_cast<Derived&>(*this).ReplaceString(
+                         s, value_data_builder.get()));
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   },
+                   [&]() {
+                     // offset for null value
+                     offset_builder->UnsafeAppend(
+                         static_cast<offset_type>(value_data_builder->length()));
+                     return Status::OK();
+                   }));
+      ArrayData* output = out->mutable_array();
+      KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&output->buffers[2]));
+      KERNEL_RETURN_IF_ERROR(ctx, offset_builder->Finish(&output->buffers[1]));
+    } else {
+      const auto& input = checked_cast<const ScalarType&>(*batch[0].scalar());
+      auto result = std::make_shared<ScalarType>();
+      if (input.is_valid) {
+        util::string_view s = static_cast<util::string_view>(*input.value);
+        KERNEL_RETURN_IF_ERROR(
+            ctx, static_cast<Derived&>(*this).ReplaceString(s, value_data_builder.get()));
+        KERNEL_RETURN_IF_ERROR(ctx, value_data_builder->Finish(&result->value));
+        result->is_valid = true;
+      }
+      out->value = result;
+    }
+  }
+};
+
+template <typename Type>
+struct ReplaceSubString : ReplaceSubStringBase<Type, ReplaceSubString<Type>> {
+  using Base = ReplaceSubStringBase<Type, ReplaceSubString<Type>>;
+  using ValueDataBuilder = typename Base::ValueDataBuilder;
+  using offset_type = typename Base::offset_type;
+
+  ReplaceSubstringOptions options;
+  explicit ReplaceSubString(KernelContext* ctx, ReplaceSubstringOptions options)
+      : options(options) {}
+
+  Status ReplaceString(util::string_view s, ValueDataBuilder* builder) {
+    const char* i = s.begin();
+    const char* end = s.end();
+    int64_t max_replacements = options.max_replacements;
+    while ((i < end) && (max_replacements != 0)) {
+      const char* pos =
+          std::search(i, end, options.pattern.begin(), options.pattern.end());
+      if (pos == end) {
+        RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                      static_cast<offset_type>(end - i)));
+        i = end;
+      } else {
+        // the string before the pattern
+        RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                      static_cast<offset_type>(pos - i)));
+        // the replacement
+        RETURN_NOT_OK(
+            builder->Append(reinterpret_cast<const uint8_t*>(options.replacement.data()),
+                            options.replacement.length()));
+        // skip pattern
+        i = pos + options.pattern.length();
+        max_replacements--;
+      }
+    }
+    // if we exited early due to max_replacements, add the trailing part
+    RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(i),
+                                  static_cast<offset_type>(end - i)));
+    return Status::OK();
+  }
+};
+
+const FunctionDoc replace_substring_doc(
+    "Replace non-overlapping substrings that match pattern by replacement",
+    ("For each string in `strings`, replace non-overlapping substrings that match\n"
+     "`pattern` by `replacement`. If `max_replacements != -1`, it determines the\n"
+     "maximum amount of replacements made, counting from the left. Null values emit\n"
+     "null."),
+    {"strings"}, "ReplaceSubstringOptions");
+
+#ifdef ARROW_WITH_RE2
+template <typename Type>
+struct ReplaceSubStringRE2 : ReplaceSubStringBase<Type, ReplaceSubStringRE2<Type>> {
+  using Base = ReplaceSubStringBase<Type, ReplaceSubStringRE2<Type>>;
+  using ValueDataBuilder = typename Base::ValueDataBuilder;
+  using offset_type = typename Base::offset_type;
+
+  ReplaceSubstringOptions options;
+  RE2 regex_find;
+  RE2 regex_replacement;
+  explicit ReplaceSubStringRE2(KernelContext* ctx, ReplaceSubstringOptions options)
+      : options(options),
+        regex_find("(" + options.pattern + ")"),
+        regex_replacement(options.pattern) {
+    // Using RE2::FindAndConsume we can only find the pattern if it is a group, therefore
+    // we have 2 regex, one with () around it, one without.
+    if (!(regex_find.ok() && regex_replacement.ok())) {
+      ctx->SetStatus(Status::Invalid("Regular expression error"));
+      return;
+    }
+  }
+  Status ReplaceString(util::string_view s, ValueDataBuilder* builder) {
+    re2::StringPiece replacement(options.replacement);
+    if (options.max_replacements == -1) {
+      std::string s_copy(s.to_string());
+      re2::RE2::GlobalReplace(&s_copy, regex_replacement, replacement);
+      RETURN_NOT_OK(builder->Append(reinterpret_cast<const uint8_t*>(s_copy.data()),
+                                    s_copy.length()));
+      return Status::OK();
+    }
+    // Since RE2 does not have the concept of max_replacements, we have to do some work
+    // ourselves.

Review comment:
       Good idea, I prefer to keep it as it is, I left a comment in the code so this doesn't get lost.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-709073664


   https://issues.apache.org/jira/browse/ARROW-10306


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-807881416


   I made ARROW-12094 for fixing that build. Merging now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-797553590


   @pitrou this is ready for review (assuming you agree with the above plan of doing a refactor later on)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-807614057


   Revision: 82bb60d7c097bcb89aed458b3a59145cb9133538
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-233](https://github.com/ursacomputing/crossbow/branches/all?query=actions-233)
   
   |Task|Status|
   |----|------|
   |conda-linux-gcc-py36-cpu-r36|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-conda-linux-gcc-py36-cpu-r36)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-conda-linux-gcc-py36-cpu-r36)|
   |conda-linux-gcc-py37-cpu-r40|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-conda-linux-gcc-py37-cpu-r40)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-conda-linux-gcc-py37-cpu-r40)|
   |conda-osx-clang-py36-r36|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-conda-osx-clang-py36-r36)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-conda-osx-clang-py36-r36)|
   |conda-osx-clang-py37-r40|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-conda-osx-clang-py37-r40)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-conda-osx-clang-py37-r40)|
   |conda-win-vs2017-py36-r36|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-conda-win-vs2017-py36-r36)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-conda-win-vs2017-py36-r36)|
   |conda-win-vs2017-py37-r40|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-conda-win-vs2017-py37-r40)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-conda-win-vs2017-py37-r40)|
   |homebrew-r-autobrew|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-233-github-homebrew-r-autobrew)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-233-github-homebrew-r-autobrew)|
   |test-r-install-local|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-233-github-test-r-install-local)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-233-github-test-r-install-local)|
   |test-r-linux-as-cran|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-233-github-test-r-linux-as-cran)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-233-github-test-r-linux-as-cran)|
   |test-r-minimal-build|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-minimal-build)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-minimal-build)|
   |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-rhub-ubuntu-gcc-release)|
   |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-rocker-r-base-latest)|
   |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-rstudio-r-base-3.6-bionic)|
   |test-r-rstudio-r-base-3.6-centos7-devtoolset-8|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-rstudio-r-base-3.6-centos7-devtoolset-8)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-rstudio-r-base-3.6-centos7-devtoolset-8)|
   |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-rstudio-r-base-3.6-centos8)|
   |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-r-rstudio-r-base-3.6-opensuse42)|
   |test-r-version-compatibility|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-233-github-test-r-version-compatibility)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-233-github-test-r-version-compatibility)|
   |test-r-versions|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-233-github-test-r-versions)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-233-github-test-r-versions)|
   |test-ubuntu-18.04-r-sanitizer|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-233-azure-test-ubuntu-18.04-r-sanitizer)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-233-azure-test-ubuntu-18.04-r-sanitizer)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8468: ARROW-10306: [C++] Add string replacement kernel

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8468:
URL: https://github.com/apache/arrow/pull/8468#issuecomment-807807284


   The as-cran failure is legit: https://github.com/ursacomputing/crossbow/runs/2197449739?check_suite_focus=true#step:7:210
   
   It's a weird build setup that uses `clang` and `-stdlib=libc++`; unfortunately we have to support it because it's one of CRAN's checks. But I'll deal with this in a followup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org