You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/02 21:04:15 UTC

[GitHub] [arrow] Christian8491 opened a new pull request #10855: ARROW-12946: [C++] String swap case kernel

Christian8491 opened a new pull request #10855:
URL: https://github.com/apache/arrow/pull/10855


   This PR adds `swapcase` compute kernel for string.  It is similar to  `Python str.swapcase()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891343458


   Also, add Python binding to `swapcase`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-893412426


   Some tests invoke the incorrect kernel (ASCII test uses `utf8_capitalize` and/or viceversa). These "typos" are fixed in https://github.com/apache/arrow/pull/10869 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681629459



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -395,6 +395,12 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
                    "[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
 }
 
+TYPED_TEST(TestStringKernels, AsciiSwapCase) {
+  this->CheckUnary("ascii_swapcase", "[]", this->type(), "[]");
+  this->CheckUnary("ascii_swapcase", "[\"aAazZæÆ&\", null, \"\", \"BbB\"]", this->type(),
+                   "[\"AaAZzæÆ&\", null, \"\", \"bBb\"]");

Review comment:
       Can you add characters that don't have a casing here?
   

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -493,6 +499,23 @@ TYPED_TEST(TestStringKernels, Utf8Lower) {
                                   CallFunction("utf8_lower", {invalid_input}));
 }
 
+TYPED_TEST(TestStringKernels, Utf8SwapCase) {
+  this->CheckUnary("utf8_swapcase", "[\"aAazZæÆ&\", null, \"\", \"b\"]", this->type(),
+                   "[\"AaAZzÆæ&\", null, \"\", \"B\"]");

Review comment:
       Same here: add characters that don't have a casing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Christian8491 commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
Christian8491 commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681881936



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -443,6 +488,17 @@ struct AsciiLower {
   }
 };
 
+void TransformAsciiSwapCase(const uint8_t* input, int64_t length, uint8_t* output) {
+  std::transform(input, input + length, output, ascii_swapcase);
+}
+
+template <typename Type>
+struct AsciiSwapCase {
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return StringDataTransform<Type>(ctx, batch, TransformAsciiSwapCase, out);
+  }
+};
+

Review comment:
       Addressed in [ba66c91](https://github.com/apache/arrow/pull/10855/commits/ba66c919e0f4e593fa0b70b193af79ea7ec76fa4)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce removed a comment on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
edponce removed a comment on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891343458


   Also, add Python binding to `swapcase`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10855:
URL: https://github.com/apache/arrow/pull/10855


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm edited a comment on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
lidavidm edited a comment on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892094979


   For CI, it appears that Ubuntu 20.04 provides utf8proc 2.5, but the isupper/islower functions are not provided until 2.6: https://juliastrings.github.io/utf8proc/releases/
   
   Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source due to some other issue, but I believe it's using the system headers anyways looking at the compiler command since the system headers come before the self-built utf8proc ones. 
   
   I think bumping the minimum utf8proc version and fiddling with the build flags so that the utf8proc headers take precedence will be needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891331609


   https://issues.apache.org/jira/browse/ARROW-12946


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892094979


   For CI, it appears that Ubuntu 20.04 provides utf8proc 2.5, but the isupper/islower functions are not provided until 2.6: https://juliastrings.github.io/utf8proc/releases/
   
   Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source, but I believe it's using the system headers looking at the compiler command since the system headers come before the self-built utf8proc ones. 
   
   I think bumping the minimum utf8proc version and fiddling with the build flags so that the utf8proc headers take precedence will be needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10855:
URL: https://github.com/apache/arrow/pull/10855


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Christian8491 commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
Christian8491 commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891981089


   @lidavidm please enable CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Christian8491 commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
Christian8491 commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681883166



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -395,6 +395,12 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
                    "[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
 }
 
+TYPED_TEST(TestStringKernels, AsciiSwapCase) {
+  this->CheckUnary("ascii_swapcase", "[]", this->type(), "[]");
+  this->CheckUnary("ascii_swapcase", "[\"aAazZæÆ&\", null, \"\", \"BbB\"]", this->type(),
+                   "[\"AaAZzæÆ&\", null, \"\", \"bBb\"]");

Review comment:
       Sure. Addressed in [8f65c70](https://github.com/apache/arrow/pull/10855/commits/8f65c700ae86d03d594dcf8957f7f17a583b2a97)

##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -493,6 +499,23 @@ TYPED_TEST(TestStringKernels, Utf8Lower) {
                                   CallFunction("utf8_lower", {invalid_input}));
 }
 
+TYPED_TEST(TestStringKernels, Utf8SwapCase) {
+  this->CheckUnary("utf8_swapcase", "[\"aAazZæÆ&\", null, \"\", \"b\"]", this->type(),
+                   "[\"AaAZzÆæ&\", null, \"\", \"B\"]");

Review comment:
       Also addressed in [8f65c70](https://github.com/apache/arrow/pull/10855/commits/8f65c700ae86d03d594dcf8957f7f17a583b2a97)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892109522


   > Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source due to some other issue, but I believe it's using the system headers anyways looking at the compiler command since the system headers come before the self-built utf8proc ones.
   
   Yeah the logs say 
   
   ```
   -- Could NOT find utf8proc (missing: utf8proc_LIB) (found suitable version "2.4.0", minimum required is "2.2.0")
   ```
   
   which doesn't make sense. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Christian8491 commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
Christian8491 commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892298100


   As @edponce suggested, I replaced some `utf8proc` functions with helper functions. After all the CI passes. This PR is ready for review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891341486


   LGTM. Need to add implementation for UTF8.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892102908


   An alternative solution is to use helper functions already available, refer to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1391-L1408
   
   Related notes in Arrow w.r.t. to lower/upper casing of UTF8 can be found in:
   * https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py#L668-L693
   * https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L578-L601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681279659



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -443,6 +488,17 @@ struct AsciiLower {
   }
 };
 
+void TransformAsciiSwapCase(const uint8_t* input, int64_t length, uint8_t* output) {
+  std::transform(input, input + length, output, ascii_swapcase);
+}
+
+template <typename Type>
+struct AsciiSwapCase {
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    return StringDataTransform<Type>(ctx, batch, TransformAsciiSwapCase, out);
+  }
+};
+

Review comment:
       Add `Utf8SwapCase`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] edponce commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel

Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r682002907



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -318,6 +373,26 @@ struct UTF8LowerTransform : public CaseMappingTransform {
 template <typename Type>
 using UTF8Lower = StringTransformExec<Type, StringTransformCodepoint<UTF8LowerTransform>>;
 
+struct UTF8SwapCaseTransform : public CaseMappingTransform {
+  static uint32_t TransformCodepoint(uint32_t codepoint) {
+    if (codepoint <= kMaxCodepointLookup) {
+      return lut_swapcase_codepoint[codepoint];
+    } else {
+      if (utf8proc_islower(codepoint)) {
+        return utf8proc_toupper(codepoint);
+      } else if (utf8proc_isupper(codepoint)) {
+        return utf8proc_tolower(codepoint);
+      } else {
+        return codepoint;
+      }

Review comment:
       The function does not ends with a `return` statement. Move the last else-case `return codepoint` as the last statement of the function.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org