You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nic Crane (Jira)" <ji...@apache.org> on 2021/07/08 09:23:00 UTC

[jira] [Comment Edited] (ARROW-13259) [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths

    [ https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377248#comment-17377248 ] 

Nic Crane edited comment on ARROW-13259 at 7/8/21, 9:22 AM:
------------------------------------------------------------

[~edponce] Thanks for highlighting that, I'd totally missed it!

[~pachamaltese] - totally missed this in my initial review of the code, but the thing that actually needs changing is the bindings in `compute.cpp` - here, start and stop have been set to 1 and -1 respectively, but instead need to reflect the default values from here: [https://github.com/apache/arrow/blob/7eea2f53a1002552bbb87db5611e75c15b88b504/cpp/src/arrow/compute/api_scalar.h#L203-L210]

I think that the `step` argument also needs implementing too.

We really should write this up (I can add it to my to-do list!) as it's neither obvious nor trivial to work out the various steps required here.

 


was (Author: thisisnic):
[~edponce] Thanks for that clarification, I'd totally missed that!

[~pachamaltese] - totally missed this in my initial review of the code, but the thing that actually needs changing is the bindings in `compute.cpp` - here, start and stop have been set to 1 and -1 respectively, but instead need to reflect the default values from here: [https://github.com/apache/arrow/blob/7eea2f53a1002552bbb87db5611e75c15b88b504/cpp/src/arrow/compute/api_scalar.h#L203-L210]

I think that the `step` argument also needs implementing too.

We really should write this up (I can add it to my to-do list!) as it's neither obvious nor trivial to work out the various steps required here.

 

> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths 
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13259
>                 URL: https://issues.apache.org/jira/browse/ARROW-13259
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nic Crane
>            Priority: Major
>
> We're currently trying to write bindings from the C++ function "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count back from the end of a string (show below in R, but the latter directly invokes the C++ implementation):
>  
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final values in a string:
>  
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string, effective walking backwards, which isn't possible (except via the step argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that calculates the length of the string and supply that to the stop argument, but it didn't work.
> I do have a possible workaround that involves reversing the string, extracting the substring using inverted values of swapped stop/start values, and then reversing the result, but before I go down that path, I was wondering if there is anything that can (and should! the answer may be a simple "nope!") be changed in the C++ code to make it possible to do this a different way?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)