You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Yuqi Gu (JIRA)" <ji...@apache.org> on 2019/08/05 03:23:00 UTC
[jira] [Commented] (ARROW-6131) [C++] Optimize the Arrow
UTF-8-string-validation
[ https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899742#comment-16899742 ]
Yuqi Gu commented on ARROW-6131:
--------------------------------
The origin utf8 benchmark :
{code:java}
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii 7 ns 7 ns 107435339 1.42978GB/s
ValidateTinyNonAscii 16 ns 16 ns 42655054 639.503MB/s
ValidateSmallAscii 29 ns 29 ns 24516945 4.4671GB/s
ValidateSmallAlmostAscii 91 ns 91 ns 7677848 1.51182GB/s
ValidateSmallNonAscii 175 ns 175 ns 4009837 731.98MB/s
ValidateLargeAscii 18821 ns 18814 ns 37194 4.95077GB/s
ValidateLargeAlmostAscii 64056 ns 64025 ns 10929 1.45533GB/s
ValidateLargeNonAscii 130321 ns 130249 ns 5375 732.909MB/s
{code}
The new algorithm:
{code:java}
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii 6 ns 6 ns 116427650 1.59527GB/s
ValidateTinyNonAscii 17 ns 17 ns 41897276 628.046MB/s
ValidateSmallAscii 117 ns 117 ns 5964896 1113.14MB/s
ValidateSmallAlmostAscii 145 ns 145 ns 4819232 971.76MB/s
ValidateSmallNonAscii 118 ns 118 ns 5947924 1085.68MB/s
ValidateLargeAscii 82297 ns 82247 ns 8511 1.13246GB/s
ValidateLargeAlmostAscii 81145 ns 81138 ns 8627 1.14838GB/s
ValidateLargeNonAscii 81221 ns 81202 ns 8621 1.14805GB/s
{code}
> [C++] Optimize the Arrow UTF-8-string-validation
> -------------------------------------------------
>
> Key: ARROW-6131
> URL: https://issues.apache.org/jira/browse/ARROW-6131
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Yuqi Gu
> Assignee: Yuqi Gu
> Priority: Major
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
> 1. Map each byte of input-string to Range table.
> 2. Leverage the Neon 'tbl' instruction to lookup table.
> 3. Find the pattern and set correct table index for each input byte
> 4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases (The input data is all ascii string).
> The benchmark API is
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow?
> Is the Arrow's data that need to be validated all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii validation, I would like to propose another optimization solution with SIMD in another jira.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)