You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/14 13:54:38 UTC

[GitHub] [arrow] pitrou opened a new pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

pitrou opened a new pull request #7755:
URL: https://github.com/apache/arrow/pull/7755


   Modified function names:
   * minmax -> min_max
   * binary_isascii -> string_isascii (only works on string types)
   * ascii_length -> binary_length (also make it work on binary types)
   * binary_contains_exact -> match_substring
     (other possibility: has_substring ?)
   * match -> index_in
   * isin -> is_in
   * list_value_lengths -> list_value_length
   * partition_indices -> partition_nth_indices
     (other kinds of partitioning would be possible, e.g. using a predicate)
   
   Document string predicate functions (ARROW-9444).
   
   Also fix the allocation of IsValid output buffer in certain cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658218464


   Does this conflict with https://github.com/apache/arrow/commit/1d7d91909aa472fc54e4a8dd0bf6c3b2d5ed7e70 which I just merged?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] xhochy commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
xhochy commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454462566



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -385,35 +371,34 @@ void TransformBinaryContainsExact(const uint8_t* pattern, int64_t pattern_length
   bitmap_writer.Finish();
 }
 
-using BinaryContainsExactState = OptionsWrapper<BinaryContainsExactOptions>;
+using MatchSubstringState = OptionsWrapper<MatchSubstringOptions>;
 
 template <typename Type>
-struct BinaryContainsExact {
+struct MatchSubstring {
   using offset_type = typename Type::offset_type;
   static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
-    BinaryContainsExactOptions arg = BinaryContainsExactState::Get(ctx);
+    MatchSubstringOptions arg = MatchSubstringState::Get(ctx);
     const uint8_t* pat = reinterpret_cast<const uint8_t*>(arg.pattern.c_str());
     const int64_t pat_size = arg.pattern.length();
     StringBoolTransform<Type>(
         ctx, batch,
         [pat, pat_size](const void* offsets, const uint8_t* data, int64_t length,
                         int64_t output_offset, uint8_t* output) {
-          TransformBinaryContainsExact<offset_type>(
+          TransformMatchSubstring<offset_type>(
               pat, pat_size, reinterpret_cast<const offset_type*>(offsets), data, length,
               output_offset, output);
         },
         out);
   }
 };
 
-void AddBinaryContainsExact(FunctionRegistry* registry) {
-  auto func = std::make_shared<ScalarFunction>("binary_contains_exact", Arity::Unary());
-  auto exec_32 = BinaryContainsExact<StringType>::Exec;
-  auto exec_64 = BinaryContainsExact<LargeStringType>::Exec;
+void AddMatchSubstring(FunctionRegistry* registry) {

Review comment:
       Yes, this is fine with me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658207408


   Rebased and fixed CI (hopefully).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658195011


   Also cc @xhochy 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7755:
URL: https://github.com/apache/arrow/pull/7755


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658211847


   Well, "substring" clearly implies "exact". As for case-sensitivity, I hope that can be a flag in the options, because I don't think we want a combinatorial explosion of string matching functions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658217270


   https://issues.apache.org/jira/browse/ARROW-9390


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454549799



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -944,7 +944,7 @@ void RegisterScalarStringAscii(FunctionRegistry* registry) {
   MakeUnaryStringBatchKernel<AsciiUpper>("ascii_upper", registry);
   MakeUnaryStringBatchKernel<AsciiLower>("ascii_lower", registry);
 
-  AddUnaryStringPredicate<IsAscii>("binary_isascii", registry);
+  AddUnaryStringPredicate<IsAscii>("string_isascii", registry);

Review comment:
       That's a good point. Should we make a followup PR?
   (if it's me, it's only tomorrow)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658265063


   I think this can be merged. Any concerns?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454553258



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -944,7 +944,7 @@ void RegisterScalarStringAscii(FunctionRegistry* registry) {
   MakeUnaryStringBatchKernel<AsciiUpper>("ascii_upper", registry);
   MakeUnaryStringBatchKernel<AsciiLower>("ascii_lower", registry);
 
-  AddUnaryStringPredicate<IsAscii>("binary_isascii", registry);
+  AddUnaryStringPredicate<IsAscii>("string_isascii", registry);

Review comment:
       I can make this change today unless someone else wants to




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658210968


   > binary_contains_exact -> match_substring (other possibility: has_substring ?)
   
   On the PR (https://github.com/apache/arrow/pull/7593#issuecomment-652195053), @xhochy mentioned that the "exact" part was important to him


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454550954



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##########
@@ -61,9 +61,9 @@ static void IsAlphaNumericAscii(benchmark::State& state) {
   UnaryStringBenchmark(state, "ascii_isalnum");
 }
 
-static void BinaryContainsExact(benchmark::State& state) {
-  BinaryContainsExactOptions options("abac");
-  UnaryStringBenchmark(state, "binary_contains_exact", &options);
+static void MatchSubstring(benchmark::State& state) {
+  MatchSubstringOptions options("abac");
+  UnaryStringBenchmark(state, "match_substring", &options);

Review comment:
       I don't think so. Personally, I find prefixes a bit pointless or even distracting.
   (or, here, plain misleading, since "binary_contains_exact" didn't work on binary arrays, only string arrays...)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658197145


   Valgrind works fine on this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658195011


   Also cc @xhochy @nealrichardson 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658264910


   AppVeyor build: https://ci.appveyor.com/project/pitrou/arrow/builds/34089592
   Github Actions: https://github.com/pitrou/arrow/runs/869791845


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454394631



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -385,35 +371,34 @@ void TransformBinaryContainsExact(const uint8_t* pattern, int64_t pattern_length
   bitmap_writer.Finish();
 }
 
-using BinaryContainsExactState = OptionsWrapper<BinaryContainsExactOptions>;
+using MatchSubstringState = OptionsWrapper<MatchSubstringOptions>;
 
 template <typename Type>
-struct BinaryContainsExact {
+struct MatchSubstring {
   using offset_type = typename Type::offset_type;
   static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
-    BinaryContainsExactOptions arg = BinaryContainsExactState::Get(ctx);
+    MatchSubstringOptions arg = MatchSubstringState::Get(ctx);
     const uint8_t* pat = reinterpret_cast<const uint8_t*>(arg.pattern.c_str());
     const int64_t pat_size = arg.pattern.length();
     StringBoolTransform<Type>(
         ctx, batch,
         [pat, pat_size](const void* offsets, const uint8_t* data, int64_t length,
                         int64_t output_offset, uint8_t* output) {
-          TransformBinaryContainsExact<offset_type>(
+          TransformMatchSubstring<offset_type>(
               pat, pat_size, reinterpret_cast<const offset_type*>(offsets), data, length,
               output_offset, output);
         },
         out);
   }
 };
 
-void AddBinaryContainsExact(FunctionRegistry* registry) {
-  auto func = std::make_shared<ScalarFunction>("binary_contains_exact", Arity::Unary());
-  auto exec_32 = BinaryContainsExact<StringType>::Exec;
-  auto exec_64 = BinaryContainsExact<LargeStringType>::Exec;
+void AddMatchSubstring(FunctionRegistry* registry) {

Review comment:
       I think this is fine, we can use `match_substring_case_insensitive` for the case insensitive version
   
   cc @xhochy 

##########
File path: cpp/src/arrow/compute/kernels/test_util.cc
##########
@@ -44,6 +46,7 @@ void CheckScalarUnary(std::string func_name, std::shared_ptr<Array> input,
   }
 
   if (auto length = input->length() / 3) {
+    // XXX Is the recursive call intended?

Review comment:
       Nope, should fix this

##########
File path: cpp/src/arrow/compute/kernels/scalar_validity.cc
##########
@@ -37,23 +37,22 @@ struct IsValidOperator {
   static void Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) {
     DCHECK_EQ(out->offset, 0);
     DCHECK_LE(out->length, arr.length);
-    if (arr.buffers[0] != nullptr) {
-      out->buffers[1] = arr.offset == 0
-                            ? arr.buffers[0]
-                            : SliceBuffer(arr.buffers[0], arr.offset / 8, arr.length / 8);
+    if (arr.null_count != 0 && arr.buffers[0] != nullptr) {

Review comment:
       Note: I added a helper function `ArrayData::MayHaveNulls` to do this since it's such a common pattern

##########
File path: cpp/src/arrow/compute/kernels/vector_nested_test.cc
##########
@@ -30,7 +30,9 @@ TEST(TestVectorNested, ListFlatten) {
     auto input = ArrayFromJSON(ty, "[[0, null, 1], null, [2, 3], []]");
     auto expected = ArrayFromJSON(int32(), "[0, null, 1, 2, 3]");
     ASSERT_OK_AND_ASSIGN(Datum out, CallFunction("list_flatten", {input}));
-    AssertArraysEqual(*expected, *out.make_array());
+    std::shared_ptr<Array> actual = std::move(out).make_array();
+    ASSERT_OK(actual->ValidateFull());
+    AssertArraysEqual(*expected, *actual);

Review comment:
       Seems like this should be extracted into a helper in test_util.h




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454553900



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -944,7 +944,7 @@ void RegisterScalarStringAscii(FunctionRegistry* registry) {
   MakeUnaryStringBatchKernel<AsciiUpper>("ascii_upper", registry);
   MakeUnaryStringBatchKernel<AsciiLower>("ascii_lower", registry);
 
-  AddUnaryStringPredicate<IsAscii>("binary_isascii", registry);
+  AddUnaryStringPredicate<IsAscii>("string_isascii", registry);

Review comment:
       Also `ascii_is_alnum`, etc. ;-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454551146



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##########
@@ -61,9 +61,9 @@ static void IsAlphaNumericAscii(benchmark::State& state) {
   UnaryStringBenchmark(state, "ascii_isalnum");
 }
 
-static void BinaryContainsExact(benchmark::State& state) {
-  BinaryContainsExactOptions options("abac");
-  UnaryStringBenchmark(state, "binary_contains_exact", &options);
+static void MatchSubstring(benchmark::State& state) {
+  MatchSubstringOptions options("abac");
+  UnaryStringBenchmark(state, "match_substring", &options);

Review comment:
       Perhaps someone else wants to think about a convention?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658212624


   > match -> index_in
   
   +1, that's a much clearer name IMO (although the R people might disagree ;-))


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658265739


   Nope. +1 here


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#issuecomment-658230513


   No, it's rebased already.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7755: ARROW-9390: [C++][Doc] Review compute function names

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7755:
URL: https://github.com/apache/arrow/pull/7755#discussion_r454506911



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##########
@@ -61,9 +61,9 @@ static void IsAlphaNumericAscii(benchmark::State& state) {
   UnaryStringBenchmark(state, "ascii_isalnum");
 }
 
-static void BinaryContainsExact(benchmark::State& state) {
-  BinaryContainsExactOptions options("abac");
-  UnaryStringBenchmark(state, "binary_contains_exact", &options);
+static void MatchSubstring(benchmark::State& state) {
+  MatchSubstringOptions options("abac");
+  UnaryStringBenchmark(state, "match_substring", &options);

Review comment:
       Are there clear criteria for when the function gets a `binary_` or `string_` prefix and when it doesn't?
   
   Related, if we haven't documented the naming conventions (haven't seen it yet but I'll keep reading), we should so that future developers/selves know what to call new functions.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -944,7 +944,7 @@ void RegisterScalarStringAscii(FunctionRegistry* registry) {
   MakeUnaryStringBatchKernel<AsciiUpper>("ascii_upper", registry);
   MakeUnaryStringBatchKernel<AsciiLower>("ascii_lower", registry);
 
-  AddUnaryStringPredicate<IsAscii>("binary_isascii", registry);
+  AddUnaryStringPredicate<IsAscii>("string_isascii", registry);

Review comment:
       Why isn't this `string_is_ascii`? It seems we prefer `is_x` to `isx` (cf. `isin` -> `is_in` in this PR).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org