You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nic Crane (Jira)" <ji...@apache.org> on 2021/07/05 15:05:00 UTC

[jira] [Created] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

Nic Crane created ARROW-13259:
---------------------------------

             Summary: [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths 
                 Key: ARROW-13259
                 URL: https://issues.apache.org/jira/browse/ARROW-13259
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Nic Crane


We're currently trying to write bindings from the C++ function "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour of R's string::str_sub

In both the R and C++ implementations, I can use negative indices to count back from the end of a string (show below in R, but the latter directly invokes the C++ implementation):

 
{code:java}
# stringr version
> stringr::str_sub("Apache Arrow", -5, -2)
[1] "Arro"

# C++ version
> call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L, stop=-1L))
Scalar
Arro{code}
Note that in the C++ implementation, I have to add 1 to the stop value as the final value is non-inclusive.

The problem is when I'm trying to use negative indices to refer to the final values in a string:

 
{code:java}
stringr version
> stringr::str_sub("Apache Arrow", -5, -1)
[1] "Arrow"

# C++ version
> call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L, stop=0L))
Scalar
{code}
The result is blank as the 'stop' value 0 refers to the start of the string, effective walking backwards, which isn't possible (except via the step argument which I can't get working but I don't think is what I want anyway).

I've tried to get around this by attempting to write some code that calculates the length of the string and supply that to the stop argument, but it didn't work.

I do have a possible workaround that involves reversing the string, extracting the substring using inverted values of swapped stop/start values, and then reversing the result, but before I go down that path, I was wondering if there is anything that can (and should! the answer may be a simple "nope!") be changed in the C++ code to make it possible to do this a different way?

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)