You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/05 14:59:58 UTC

[GitHub] [arrow] maartenbreddels opened a new pull request #7357: ARROW-555: [C++] String algorithms

maartenbreddels opened a new pull request #7357:
URL: https://github.com/apache/arrow/pull/7357


   First PR to test the workflow and CI.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7357: ARROW-555: [C++] String algorithms

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#issuecomment-639553742


   https://issues.apache.org/jira/browse/ARROW-555


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] edited a comment on pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] edited a comment on pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#issuecomment-639553742


   https://issues.apache.org/jira/browse/ARROW-9100


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438827531



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();
+    std::transform(result.begin(), result.end(), result.begin(),
+                   [](unsigned char c) { return std::tolower(c); });

Review comment:
       Note that I used `std::toupper` _just_ for illustration purposes, not because I thought it was the correct way to implement the kernel. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438718661



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       For `ascii_upper`, the strategy should be simple:
   * reuse the same null bitmap buffer
   * recompute a similar offsets buffer, but without the base offset (or reuse the same offsets buffer, if the base offset is 0)
   * allocate a same-size values buffer, and convert it in one pass (no need to iterate over individual array elements)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438718060



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       I think we should start finding a sane strategy right now for the existing `ascii_upper` kernel. Indeed, a different strategy will be needed for utf8 `upper`.
   
   In both cases, this probably means implementing those kernels as "vector" kernels, not "scalar" kernels (because the latter implies you only process one item at a time).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7357:
URL: https://github.com/apache/arrow/pull/7357


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #7357: ARROW-555: [C++] String algorithms

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#issuecomment-642451423


   Few questions about the preferred workflows (didn't see this in the contribs guidelines):
    * should Jira issues and PR's be 1 tot 1, or can I open multiple PR's referencing the same Jira issue
    * 1 commit per PR, or multiple commits per PR if related, but each commit one idea/topic/functionlity?
   
   I have to dive a bit into how the kernels work, but with the current implementation of ascii_lower and upper I see performance issues (we should be able to write into a pre-allocated buffer, since we know the byte length). I'm happy first getting some basic functionality in, and optimize later, it will surface the issues around the framework a bit more I think.
   
   So would you prefer to get in some basic string functions first, and optimize later? In that case, we can rename this PR and merge it, I'll open a few more PRs, and once a few functions are in, we have a baseline for benchmarking and can optimize.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] xhochy commented on pull request #7357: ARROW-555: [C++] String algorithms

Posted by GitBox <gi...@apache.org>.
xhochy commented on pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#issuecomment-642456798


   > * should Jira issues and PR's be 1 tot 1, or can I open multiple PR's referencing the same Jira issue
   
   At least one JIRA pr PR, everything should have an issue number. If there isn't a need for more explanation, a title with correct tagging is enough. It would be helpful to have such JIRAs also to coordinate who is working which task.
   
   > * 1 commit per PR, or multiple commits per PR if related, but each commit one idea/topic/functionlity?
   
   As many commits as it makes sense to review, we will squash on merge. The important detail in the past that was annoying (also can be helpful) is that reviewers will not be notified for changes if you force-push.
   
   
   > I have to dive a bit into how the kernels work, but with the current implementation of ascii_lower and upper I see performance issues (we should be able to write into a pre-allocated buffer, since we know the byte length). I'm happy first getting some basic functionality in, and optimize later, it will surface the issues around the framework a bit more I think.
   
   I'm also quite concerned about this performance penalty. But for e.g. `utf8_upper` (let's just call it `upper`), I expect that we cannot pre-allocate and thus these are the things where we could continue to as-is.
   
   I'm happy to merge this PR (with a new JIRA id), we should though work on writing into pre-allocated memory before writing more kernels that would greatly benefit from pre-allocation. For optimization in the pre-allocated case, this should be a sufficient baseline. I'm more concerned about performance when we start to implement UTF-8 based functionality as we quickly get to the point where we need to look at multiple bytes at a time but don't know ahead whether it's 1, 2, 3, or 4. There we probably should implement kernels with decent performance and one we have a set of 5-10, we can look on how to improve performance there.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438713248



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       Yes, @xhochy and I discussed this above. The question is, do we want a few functions in, and then start iterating on a strategy (for ascii it's trivial, for utf8 less so), or not, I'm happy to explore different strategies first as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438733024



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       Avoiding boilerplate is the main motivation for scalar kernels. It really saves a lot of repetition in kernel implementations.
   
   You can find a simple vector kernel example in `compute/kernels/vector_sort.cc`. Take a look especially at `RegisterVectorSort`, `AddSortingKernels` and `SortIndices`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438723590



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       Agree. Happy to take a look into turning this into a 'vector' kernel, if you have some pointer that would be great. This also answers a question I had if it would be possible to operate on a full array/chunk in the existing framework, I guess that's a yes.
   
   Any reason not to reuse the offset buffer, even if the offset is not 0?
   
   What is the advantage of scalar kernels in Arrow? Apart from avoiding boilerplate? Also wondering if this is related to the Gandiva connection.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438718661



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       For `ascii_upper`, the strategy should be simple:
   * keep the same null bitmap buffer
   * recompute a similar offsets buffer, but without the base offset
   * allocate a same-size values buffer, and convert it in one pass (no need to iterate over individual array elements)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#issuecomment-642474562


   https://issues.apache.org/jira/browse/ARROW-9100


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7357: ARROW-555: [C++] String algorithms

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#issuecomment-642339843


   @maartenbreddels are you planning to do more work in this PR (it wasn't clear since it's marked as Draft)? We are going to need to create many child JIRAs of ARROW-555 so the work can proceed in parallel. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438829269



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       I'm afraid you are mistaking what is meant by "ScalarFunction" and "VectorFunction". "Scalar" and "Vector" refer to the _semantics_ of the function, not the implementation. 
   
   The only requirement for these string functions is that you provide `std::function<void(KernelContext*, const ExecBatch&, Datum*)>` that implements the kernel. What is inside can be anything. The function is **still** a ScalarFunction though




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7357: ARROW-9100: [C++] Add ascii_lower kernel

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#discussion_r438707793



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();
+    std::transform(result.begin(), result.end(), result.begin(),
+                   [](unsigned char c) { return std::tolower(c); });

Review comment:
       Also, please fix `AsciiUpper` similarly.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();
+    std::transform(result.begin(), result.end(), result.begin(),
+                   [](unsigned char c) { return std::tolower(c); });

Review comment:
       Please don't use `std::tolower` as it has locale-dependent behaviour. Instead, just create a hardcoded lookup table.

##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -48,6 +48,16 @@ struct AsciiUpper {
   }
 };
 
+struct AsciiLower {
+  template <typename... Ignored>
+  static std::string Call(KernelContext*, const util::string_view& val) {
+    std::string result = val.to_string();

Review comment:
       This is completely inefficient. We should write directly into the allocated array. The way this is architected needs rethinking.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #7357: ARROW-555: [C++] String algorithms

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #7357:
URL: https://github.com/apache/arrow/pull/7357#issuecomment-642472363


   > I expect that we cannot pre-allocate and thus these are the things where we could continue to as-is.
   
   Yeah, it's a bit tricky, but there are lots of options that would require some benchmarking I think. In most cases starting with a buffer of the same length should work, a buffer of 4*bufferlength would always work, but might waste memory, or require a realloc. The buffer can also grow dynamically (that's what I do in vaex, it grows by 2x when too small), but different algorithms may need different strategies. For instance, in upper/lower algos, growing by 20% each time the buffer is too small sounds.
   
   Also, making the algorithms SIMD optimizable (by the compiler) is another thing to think about. So I agree, having a few (diverse) functions first would make sense to have in.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org