You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kshiteej K (Jira)" <ji...@apache.org> on 2022/10/29 08:58:00 UTC
[jira] [Commented] (ARROW-17301) [C++] Implement compute function "binary_slice"
[ https://issues.apache.org/jira/browse/ARROW-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626023#comment-17626023 ]
Kshiteej K commented on ARROW-17301:
------------------------------------
IIUC, the new function will be binary equivalent of existing `utf8_slice_codeunits` which takes `start`, `stop` and `step` and returns slices.
Is my understanding correct?
Ref: [https://arrow.apache.org/docs/cpp/compute.html#string-slicing]
> [C++] Implement compute function "binary_slice"
> -----------------------------------------------
>
> Key: ARROW-17301
> URL: https://issues.apache.org/jira/browse/ARROW-17301
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 8.0.1
> Reporter: ChenTsing
> Assignee: Kshiteej K
> Priority: Major
> Fix For: 11.0.0
>
>
> In some situations, may request an access method to get binary or sting likes array one or some continuous bytes , for example start 1 end 3 step 1, the two bytes, it seems like "{{{}binary_replace_slice{}}} " function, provide byte and code two measurement unit
>
>
> h1. *application case:*
>
> here, I can give one example to descirbe why need a function to extract binary in byte unit:
>
> In distribute database, data has distribute policy and relatived hash algorithm for different data type, here we just discuss string-like and binary type, the hash algorithm need detach string-like or binary in bytes to calculating, for example , take 1-4 byte cast to integer and shift-left 16 bits, then take 5-6byte cast to integer and the result from last step, and so on, the 'utf8_slice_codeunits' function can partly meet the require if all are ascii, but if the string-like contain chinese, one chinese may occupied three bytes, start 1 to end 3, three utf8 character
> may take nine bytes, but it not meet the hash algorithm, it only need 3 bytes, so if provide a function but not cast, the same function arguments like 'utf8_slice_codeunits', it may called 'binary_slice_byteunit'
--
This message was sent by Atlassian Jira
(v8.20.10#820010)