You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/02 21:04:15 UTC
[GitHub] [arrow] Christian8491 opened a new pull request #10855: ARROW-12946: [C++] String swap case kernel
Christian8491 opened a new pull request #10855:
URL: https://github.com/apache/arrow/pull/10855
This PR adds `swapcase` compute kernel for string. It is similar to `Python str.swapcase()`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891343458
Also, add Python binding to `swapcase`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-893412426
Some tests invoke the incorrect kernel (ASCII test uses `utf8_capitalize` and/or viceversa). These "typos" are fixed in https://github.com/apache/arrow/pull/10869
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681629459
##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -395,6 +395,12 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
"[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
}
+TYPED_TEST(TestStringKernels, AsciiSwapCase) {
+ this->CheckUnary("ascii_swapcase", "[]", this->type(), "[]");
+ this->CheckUnary("ascii_swapcase", "[\"aAazZæÆ&\", null, \"\", \"BbB\"]", this->type(),
+ "[\"AaAZzæÆ&\", null, \"\", \"bBb\"]");
Review comment:
Can you add characters that don't have a casing here?
##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -493,6 +499,23 @@ TYPED_TEST(TestStringKernels, Utf8Lower) {
CallFunction("utf8_lower", {invalid_input}));
}
+TYPED_TEST(TestStringKernels, Utf8SwapCase) {
+ this->CheckUnary("utf8_swapcase", "[\"aAazZæÆ&\", null, \"\", \"b\"]", this->type(),
+ "[\"AaAZzÆæ&\", null, \"\", \"B\"]");
Review comment:
Same here: add characters that don't have a casing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Christian8491 commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
Christian8491 commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681881936
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -443,6 +488,17 @@ struct AsciiLower {
}
};
+void TransformAsciiSwapCase(const uint8_t* input, int64_t length, uint8_t* output) {
+ std::transform(input, input + length, output, ascii_swapcase);
+}
+
+template <typename Type>
+struct AsciiSwapCase {
+ static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+ return StringDataTransform<Type>(ctx, batch, TransformAsciiSwapCase, out);
+ }
+};
+
Review comment:
Addressed in [ba66c91](https://github.com/apache/arrow/pull/10855/commits/ba66c919e0f4e593fa0b70b193af79ea7ec76fa4)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] edponce removed a comment on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
edponce removed a comment on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891343458
Also, add Python binding to `swapcase`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10855:
URL: https://github.com/apache/arrow/pull/10855
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm edited a comment on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
lidavidm edited a comment on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892094979
For CI, it appears that Ubuntu 20.04 provides utf8proc 2.5, but the isupper/islower functions are not provided until 2.6: https://juliastrings.github.io/utf8proc/releases/
Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source due to some other issue, but I believe it's using the system headers anyways looking at the compiler command since the system headers come before the self-built utf8proc ones.
I think bumping the minimum utf8proc version and fiddling with the build flags so that the utf8proc headers take precedence will be needed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891331609
https://issues.apache.org/jira/browse/ARROW-12946
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892094979
For CI, it appears that Ubuntu 20.04 provides utf8proc 2.5, but the isupper/islower functions are not provided until 2.6: https://juliastrings.github.io/utf8proc/releases/
Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source, but I believe it's using the system headers looking at the compiler command since the system headers come before the self-built utf8proc ones.
I think bumping the minimum utf8proc version and fiddling with the build flags so that the utf8proc headers take precedence will be needed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10855:
URL: https://github.com/apache/arrow/pull/10855
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Christian8491 commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
Christian8491 commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891981089
@lidavidm please enable CI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Christian8491 commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
Christian8491 commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681883166
##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -395,6 +395,12 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
"[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
}
+TYPED_TEST(TestStringKernels, AsciiSwapCase) {
+ this->CheckUnary("ascii_swapcase", "[]", this->type(), "[]");
+ this->CheckUnary("ascii_swapcase", "[\"aAazZæÆ&\", null, \"\", \"BbB\"]", this->type(),
+ "[\"AaAZzæÆ&\", null, \"\", \"bBb\"]");
Review comment:
Sure. Addressed in [8f65c70](https://github.com/apache/arrow/pull/10855/commits/8f65c700ae86d03d594dcf8957f7f17a583b2a97)
##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -493,6 +499,23 @@ TYPED_TEST(TestStringKernels, Utf8Lower) {
CallFunction("utf8_lower", {invalid_input}));
}
+TYPED_TEST(TestStringKernels, Utf8SwapCase) {
+ this->CheckUnary("utf8_swapcase", "[\"aAazZæÆ&\", null, \"\", \"b\"]", this->type(),
+ "[\"AaAZzÆæ&\", null, \"\", \"B\"]");
Review comment:
Also addressed in [8f65c70](https://github.com/apache/arrow/pull/10855/commits/8f65c700ae86d03d594dcf8957f7f17a583b2a97)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892109522
> Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source due to some other issue, but I believe it's using the system headers anyways looking at the compiler command since the system headers come before the self-built utf8proc ones.
Yeah the logs say
```
-- Could NOT find utf8proc (missing: utf8proc_LIB) (found suitable version "2.4.0", minimum required is "2.2.0")
```
which doesn't make sense.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Christian8491 commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
Christian8491 commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892298100
As @edponce suggested, I replaced some `utf8proc` functions with helper functions. After all the CI passes. This PR is ready for review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-891341486
LGTM. Need to add implementation for UTF8.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] edponce commented on pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
edponce commented on pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#issuecomment-892102908
An alternative solution is to use helper functions already available, refer to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1391-L1408
Related notes in Arrow w.r.t. to lower/upper casing of UTF8 can be found in:
* https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py#L668-L693
* https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L578-L601
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] edponce commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r681279659
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -443,6 +488,17 @@ struct AsciiLower {
}
};
+void TransformAsciiSwapCase(const uint8_t* input, int64_t length, uint8_t* output) {
+ std::transform(input, input + length, output, ascii_swapcase);
+}
+
+template <typename Type>
+struct AsciiSwapCase {
+ static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+ return StringDataTransform<Type>(ctx, batch, TransformAsciiSwapCase, out);
+ }
+};
+
Review comment:
Add `Utf8SwapCase`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] edponce commented on a change in pull request #10855: ARROW-12946: [C++] String swap case kernel
Posted by GitBox <gi...@apache.org>.
edponce commented on a change in pull request #10855:
URL: https://github.com/apache/arrow/pull/10855#discussion_r682002907
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -318,6 +373,26 @@ struct UTF8LowerTransform : public CaseMappingTransform {
template <typename Type>
using UTF8Lower = StringTransformExec<Type, StringTransformCodepoint<UTF8LowerTransform>>;
+struct UTF8SwapCaseTransform : public CaseMappingTransform {
+ static uint32_t TransformCodepoint(uint32_t codepoint) {
+ if (codepoint <= kMaxCodepointLookup) {
+ return lut_swapcase_codepoint[codepoint];
+ } else {
+ if (utf8proc_islower(codepoint)) {
+ return utf8proc_toupper(codepoint);
+ } else if (utf8proc_isupper(codepoint)) {
+ return utf8proc_tolower(codepoint);
+ } else {
+ return codepoint;
+ }
Review comment:
The function does not ends with a `return` statement. Move the last else-case `return codepoint` as the last statement of the function.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org