You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/12 13:13:12 UTC

[GitHub] [arrow] wesm opened a new pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

wesm opened a new pull request #7418:
URL: https://github.com/apache/arrow/pull/7418


   Following on discussion in #7357. I added a simple benchmark also.
   
   ```
   --------------------------------------------------
   Benchmark           Time           CPU Iterations
   --------------------------------------------------
   AsciiLower    4826815 ns    4826783 ns        143   207.177M items/s
   AsciiUpper    4846000 ns    4845994 ns        136   206.356M items/s
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439451091



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -37,26 +39,108 @@ struct AsciiLength {
   }
 };
 
-struct AsciiUpper {
-  // XXX: the Scalar codegen path passes template arguments that are unused
-  template <typename... Ignored>
-  static std::string Call(KernelContext*, const util::string_view& val) {
-    std::string result = val.to_string();
-    std::transform(result.begin(), result.end(), result.begin(),
-                   [](unsigned char c) { return std::toupper(c); });
-    return result;
+using TransformFunc = std::function<void(const uint8_t*, int64_t, uint8_t*)>;
+
+void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,
+                         TransformFunc transform, Datum* out) {
+  if (batch[0].kind() == Datum::ARRAY) {
+    const ArrayData& input = *batch[0].array();
+    ArrayData* out_arr = out->mutable_array();
+    // Reuse offsets from input
+    out_arr->buffers[1] = input.buffers[1];

Review comment:
       https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#buffer-listing-for-each-layout




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439448175



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -37,26 +39,108 @@ struct AsciiLength {
   }
 };
 
-struct AsciiUpper {
-  // XXX: the Scalar codegen path passes template arguments that are unused
-  template <typename... Ignored>
-  static std::string Call(KernelContext*, const util::string_view& val) {
-    std::string result = val.to_string();
-    std::transform(result.begin(), result.end(), result.begin(),
-                   [](unsigned char c) { return std::toupper(c); });
-    return result;
+using TransformFunc = std::function<void(const uint8_t*, int64_t, uint8_t*)>;
+
+void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,
+                         TransformFunc transform, Datum* out) {
+  if (batch[0].kind() == Datum::ARRAY) {
+    const ArrayData& input = *batch[0].array();
+    ArrayData* out_arr = out->mutable_array();
+    // Reuse offsets from input
+    out_arr->buffers[1] = input.buffers[1];

Review comment:
       These buffers[1], buffers[2] are mysterious to me. Any hint to figure it out? Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pierrebelzile commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
pierrebelzile commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439754427



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -37,26 +39,108 @@ struct AsciiLength {
   }
 };
 
-struct AsciiUpper {
-  // XXX: the Scalar codegen path passes template arguments that are unused
-  template <typename... Ignored>
-  static std::string Call(KernelContext*, const util::string_view& val) {
-    std::string result = val.to_string();
-    std::transform(result.begin(), result.end(), result.begin(),
-                   [](unsigned char c) { return std::toupper(c); });
-    return result;
+using TransformFunc = std::function<void(const uint8_t*, int64_t, uint8_t*)>;
+
+void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,
+                         TransformFunc transform, Datum* out) {
+  if (batch[0].kind() == Datum::ARRAY) {
+    const ArrayData& input = *batch[0].array();
+    ArrayData* out_arr = out->mutable_array();
+    // Reuse offsets from input
+    out_arr->buffers[1] = input.buffers[1];
+    int64_t data_nbytes = input.buffers[2]->size();
+    KERNEL_RETURN_IF_ERROR(ctx, ctx->Allocate(data_nbytes).Value(&out_arr->buffers[2]));
+    transform(input.buffers[2]->data(), data_nbytes, out_arr->buffers[2]->mutable_data());

Review comment:
       Isn’t it necessary to consider offset of slice?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#issuecomment-643268699


   I get similar numbers here. It seems to be a 15x speedup over git master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439755593



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -37,26 +39,108 @@ struct AsciiLength {
   }
 };
 
-struct AsciiUpper {
-  // XXX: the Scalar codegen path passes template arguments that are unused
-  template <typename... Ignored>
-  static std::string Call(KernelContext*, const util::string_view& val) {
-    std::string result = val.to_string();
-    std::transform(result.begin(), result.end(), result.begin(),
-                   [](unsigned char c) { return std::toupper(c); });
-    return result;
+using TransformFunc = std::function<void(const uint8_t*, int64_t, uint8_t*)>;
+
+void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,
+                         TransformFunc transform, Datum* out) {
+  if (batch[0].kind() == Datum::ARRAY) {
+    const ArrayData& input = *batch[0].array();
+    ArrayData* out_arr = out->mutable_array();
+    // Reuse offsets from input
+    out_arr->buffers[1] = input.buffers[1];
+    int64_t data_nbytes = input.buffers[2]->size();
+    KERNEL_RETURN_IF_ERROR(ctx, ctx->Allocate(data_nbytes).Value(&out_arr->buffers[2]));
+    transform(input.buffers[2]->data(), data_nbytes, out_arr->buffers[2]->mutable_data());

Review comment:
       Good point. It's not accounted for in the unit tests either, can you open a JIRA issue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#issuecomment-643265306


   https://issues.apache.org/jira/browse/ARROW-9115


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#issuecomment-643441495


   Excellent, I was planning to look at this next week. But this probably saved me quite some time, and gives me some more examples, thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#issuecomment-643461172


   Sounds good. I'll probably implement some more example kernels soon that have to process each value individually and be more efficient than the prior examples (by not copying anything if it isn't necessary, etc). Thinking strip/rstrip/lstrip


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#issuecomment-643263350


   cc @maartenbreddels @pitrou 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7418:
URL: https://github.com/apache/arrow/pull/7418


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#issuecomment-643273901


   Before:
   ```
   AsciiLower   76218768 ns     76206752 ns           28 bytes_per_second=207.921M/s items_per_second=13.7596M/s
   AsciiUpper   83254436 ns     83232143 ns           26 bytes_per_second=190.371M/s items_per_second=12.5982M/s
   ```
   After:
   ```
   AsciiLower    4512754 ns      4510290 ns          483 bytes_per_second=3.43073G/s items_per_second=232.485M/s
   AsciiUpper    4536864 ns      4534349 ns          462 bytes_per_second=3.41253G/s items_per_second=231.252M/s
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439765677



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -37,26 +39,108 @@ struct AsciiLength {
   }
 };
 
-struct AsciiUpper {
-  // XXX: the Scalar codegen path passes template arguments that are unused
-  template <typename... Ignored>
-  static std::string Call(KernelContext*, const util::string_view& val) {
-    std::string result = val.to_string();
-    std::transform(result.begin(), result.end(), result.begin(),
-                   [](unsigned char c) { return std::toupper(c); });
-    return result;
+using TransformFunc = std::function<void(const uint8_t*, int64_t, uint8_t*)>;
+
+void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,
+                         TransformFunc transform, Datum* out) {
+  if (batch[0].kind() == Datum::ARRAY) {
+    const ArrayData& input = *batch[0].array();
+    ArrayData* out_arr = out->mutable_array();
+    // Reuse offsets from input
+    out_arr->buffers[1] = input.buffers[1];
+    int64_t data_nbytes = input.buffers[2]->size();
+    KERNEL_RETURN_IF_ERROR(ctx, ctx->Allocate(data_nbytes).Value(&out_arr->buffers[2]));
+    transform(input.buffers[2]->data(), data_nbytes, out_arr->buffers[2]->mutable_data());

Review comment:
       https://issues.apache.org/jira/browse/ARROW-9122




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439415981



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -37,26 +37,108 @@ struct AsciiLength {
   }
 };
 
-struct AsciiUpper {
-  // XXX: the Scalar codegen path passes template arguments that are unused
-  template <typename... Ignored>
-  static std::string Call(KernelContext*, const util::string_view& val) {
-    std::string result = val.to_string();
-    std::transform(result.begin(), result.end(), result.begin(),
-                   [](unsigned char c) { return std::toupper(c); });
-    return result;
+using TransformFunc = std::function<void(const uint8_t*, int64_t, uint8_t*)>;
+
+void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,

Review comment:
       Most everything should be under the anonymous namespace in this file.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439418114



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -37,26 +37,108 @@ struct AsciiLength {
   }
 };
 
-struct AsciiUpper {
-  // XXX: the Scalar codegen path passes template arguments that are unused
-  template <typename... Ignored>
-  static std::string Call(KernelContext*, const util::string_view& val) {
-    std::string result = val.to_string();
-    std::transform(result.begin(), result.end(), result.begin(),
-                   [](unsigned char c) { return std::toupper(c); });
-    return result;
+using TransformFunc = std::function<void(const uint8_t*, int64_t, uint8_t*)>;
+
+void StringDataTransform(KernelContext* ctx, const ExecBatch& batch,

Review comment:
       OK

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "arrow/compute/api_scalar.h"
+#include "arrow/compute/benchmark_util.h"
+#include "arrow/compute/kernels/test_util.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+
+namespace arrow {
+namespace compute {
+
+constexpr auto kSeed = 0x94378165;
+
+static void UnaryStringBenchmark(benchmark::State& state, const std::string& func_name) {
+  const int64_t array_length = 1 << 20;
+  const int64_t value_min_size = 0;
+  const int64_t value_max_size = 32;
+  const double null_probability = 0.01;
+  random::RandomArrayGenerator rng(kSeed);
+
+  auto values =
+      rng.String(array_length, value_min_size, value_max_size, null_probability);
+  for (auto _ : state) {
+    ABORT_NOT_OK(CallFunction(func_name, {values}));
+  }
+  state.SetItemsProcessed(state.iterations() * array_length);

Review comment:
       Yes, will add




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#issuecomment-643461172


   Sounds good. I'll probably implement some more example kernels soon that have to process each value individually and be more efficient than the prior to examples (by not copying anything if it isn't necessary, etc). Thinking strip/rstrip/lstrip


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7418: ARROW-9115: [C++] Implementation of ascii_lower/ascii_upper by processing input data buffers in batch

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7418:
URL: https://github.com/apache/arrow/pull/7418#discussion_r439416979



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "benchmark/benchmark.h"
+
+#include "arrow/compute/api_scalar.h"
+#include "arrow/compute/benchmark_util.h"
+#include "arrow/compute/kernels/test_util.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+
+namespace arrow {
+namespace compute {
+
+constexpr auto kSeed = 0x94378165;
+
+static void UnaryStringBenchmark(benchmark::State& state, const std::string& func_name) {
+  const int64_t array_length = 1 << 20;
+  const int64_t value_min_size = 0;
+  const int64_t value_max_size = 32;
+  const double null_probability = 0.01;
+  random::RandomArrayGenerator rng(kSeed);
+
+  auto values =
+      rng.String(array_length, value_min_size, value_max_size, null_probability);
+  for (auto _ : state) {
+    ABORT_NOT_OK(CallFunction(func_name, {values}));
+  }
+  state.SetItemsProcessed(state.iterations() * array_length);

Review comment:
       It would be nice to also add a `SetBytesProcessed` based on the actual data length, btw. It's easy to mistake the "items per second" result for a bytes-per-second number.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org