You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Todd Farmer (Jira)" <ji...@apache.org> on 2022/07/12 14:05:03 UTC

[jira] [Assigned] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

     [ https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Farmer reassigned ARROW-6131:
----------------------------------

    Assignee:     (was: Yuqi Gu)

This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

> [C++]  Optimize the Arrow UTF-8-string-validation
> -------------------------------------------------
>
>                 Key: ARROW-6131
>                 URL: https://issues.apache.org/jira/browse/ARROW-6131
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yuqi Gu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
>   1. Map each byte of input-string to Range table.
>   2. Leverage the Neon 'tbl' instruction to lookup table.
>   3. Find the pattern and set correct table index for each input byte
>   4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases (The input data is all ascii string).
> The benchmark API is  
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow? 
> Is the Arrow's data that need to be validated  all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii  validation,  I would like to propose another optimization solution with SIMD in another jira.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)