You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Yuqi Gu (JIRA)" <ji...@apache.org> on 2019/08/05 03:23:00 UTC
[jira] [Commented] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

    [ https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899742#comment-16899742 ] 

Yuqi Gu commented on ARROW-6131:
--------------------------------

The origin utf8 benchmark :

{code:java}
----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii                 7 ns          7 ns  107435339   1.42978GB/s
ValidateTinyNonAscii             16 ns         16 ns   42655054   639.503MB/s
ValidateSmallAscii               29 ns         29 ns   24516945    4.4671GB/s
ValidateSmallAlmostAscii         91 ns         91 ns    7677848   1.51182GB/s
ValidateSmallNonAscii           175 ns        175 ns    4009837    731.98MB/s
ValidateLargeAscii            18821 ns      18814 ns      37194   4.95077GB/s
ValidateLargeAlmostAscii      64056 ns      64025 ns      10929   1.45533GB/s
ValidateLargeNonAscii        130321 ns     130249 ns       5375   732.909MB/s
{code}


The new algorithm:

{code:java}
----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii                 6 ns          6 ns  116427650   1.59527GB/s
ValidateTinyNonAscii             17 ns         17 ns   41897276   628.046MB/s
ValidateSmallAscii              117 ns        117 ns    5964896   1113.14MB/s
ValidateSmallAlmostAscii        145 ns        145 ns    4819232    971.76MB/s
ValidateSmallNonAscii           118 ns        118 ns    5947924   1085.68MB/s
ValidateLargeAscii            82297 ns      82247 ns       8511   1.13246GB/s
ValidateLargeAlmostAscii      81145 ns      81138 ns       8627   1.14838GB/s
ValidateLargeNonAscii         81221 ns      81202 ns       8621   1.14805GB/s
{code}





> [C++]  Optimize the Arrow UTF-8-string-validation
> -------------------------------------------------
>
>                 Key: ARROW-6131
>                 URL: https://issues.apache.org/jira/browse/ARROW-6131
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Yuqi Gu
>            Assignee: Yuqi Gu
>            Priority: Major
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
>   1. Map each byte of input-string to Range table.
>   2. Leverage the Neon 'tbl' instruction to lookup table.
>   3. Find the pattern and set correct table index for each input byte
>   4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases (The input data is all ascii string).
> The benchmark API is  
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow? 
> Is the Arrow's data that need to be validated  all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii  validation,  I would like to propose another optimization solution with SIMD in another jira.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)