You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/14 14:56:00 UTC

[jira] [Commented] (ARROW-9991) [C++] split kernels for strings/binary

    [ https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195508#comment-17195508 ] 

Joris Van den Bossche commented on ARROW-9991:
----------------------------------------------

And I suppose "whitespace" here is more than a split on " " ? (also multiple spaces, different kinds of newlines, tabs, etc?) In that case, a separate specialized kernel seems indeed best. 

> [C++] split kernels for strings/binary
> --------------------------------------
>
>                 Key: ARROW-9991
>                 URL: https://issues.apache.org/jira/browse/ARROW-9991
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Maarten Breddels
>            Assignee: Maarten Breddels
>            Priority: Major
>
> Similar to Python str.split and bytes.split, we'd like to have a way to convert str into list[str] (and similarly for bytes).
> When the separator is given, the algorithms for both types are the same. Python, however, overloads strip. When given no separator, the algorithm will split considering all whitespace (unicode for str, ascii for bytes) as separator.
> I'd rather see not too much overloaded kernels, e.g.
> binary_split (takes string/binary separator, and maxsplit arg, no special utf8 version needed)
> utf8_split_whitespace (similar to Python's version given no separator)
> ascii_split_whitespace (similar to Python's version given no separator, but considering ascii, although this could work on any binary data)
> there can also be rsplit versions of these, or they could be an argument.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)