You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2022/03/09 07:55:00 UTC

[jira] [Commented] (ARROW-15878) [C++] Optimize csv writer for string with quotes

    [ https://issues.apache.org/jira/browse/ARROW-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503359#comment-17503359 ] 

Yibo Cai commented on ARROW-15878:
----------------------------------

Compared two possible optimization approaches. Not very satisfying with the results.
Benchmarked two secnarios:
- WriteCsvStringWithQuote is the best case, there is only one quote at string end
- WriteCsvStringAllQuotes is the worse case, the whole string is filled with quote

*appoach0(baseline): naive char by char copying*
{code:bash}
WriteCsvStringWithQuote/0      938246 ns       938230 ns          745 bytes_per_second=490.029M/s null_percent=0
WriteCsvStringWithQuote/1     1014895 ns      1014890 ns          688 bytes_per_second=448.751M/s null_percent=1
WriteCsvStringWithQuote/10    1060796 ns      1060780 ns          659 bytes_per_second=393.06M/s null_percent=10
WriteCsvStringWithQuote/50     891765 ns       891760 ns          786 bytes_per_second=269.686M/s null_percent=50
WriteCsvStringAllQuotes/0     1001146 ns      1001109 ns          699 bytes_per_second=785.086M/s null_percent=0
WriteCsvStringAllQuotes/1     1053971 ns      1053956 ns          664 bytes_per_second=738.526M/s null_percent=1
WriteCsvStringAllQuotes/10    1102326 ns      1102258 ns          655 bytes_per_second=645.272M/s null_percent=10
WriteCsvStringAllQuotes/50     894888 ns       894843 ns          781 bytes_per_second=451.882M/s null_percent=50
{code}

*approach1: continue finding next quote, memcpy*
- best case: improves 20%
- worse case: drops 70%

{code:bash}
WriteCsvStringWithQuote/0      785568 ns       785549 ns          889 bytes_per_second=585.272M/s null_percent=0
WriteCsvStringWithQuote/1      849845 ns       849834 ns          821 bytes_per_second=535.908M/s null_percent=1
WriteCsvStringWithQuote/10     885708 ns       885696 ns          790 bytes_per_second=470.76M/s null_percent=10
WriteCsvStringWithQuote/50     840687 ns       840662 ns          832 bytes_per_second=286.079M/s null_percent=50
WriteCsvStringAllQuotes/0     3606928 ns      3606876 ns          192 bytes_per_second=217.905M/s null_percent=0
WriteCsvStringAllQuotes/1     3765233 ns      3765083 ns          186 bytes_per_second=206.735M/s null_percent=1
WriteCsvStringAllQuotes/10    3686031 ns      3685964 ns          190 bytes_per_second=192.963M/s null_percent=10
WriteCsvStringAllQuotes/50    2362894 ns      2362807 ns          295 bytes_per_second=171.137M/s null_percent=50
{code}

*approach2: check 8 chars, memcpy if no quote, otherwise copy char by char*
- best case: improves 10%
- worst case: no difference

{code:bash}
WriteCsvStringWithQuote/0      862995 ns       862991 ns          809 bytes_per_second=532.751M/s null_percent=0
WriteCsvStringWithQuote/1      900671 ns       900650 ns          774 bytes_per_second=505.671M/s null_percent=1
WriteCsvStringWithQuote/10     896087 ns       896066 ns          779 bytes_per_second=465.312M/s null_percent=10
WriteCsvStringWithQuote/50     805413 ns       805363 ns          870 bytes_per_second=298.618M/s null_percent=50
WriteCsvStringAllQuotes/0      993539 ns       993503 ns          702 bytes_per_second=791.097M/s null_percent=0
WriteCsvStringAllQuotes/1     1043675 ns      1043650 ns          671 bytes_per_second=745.819M/s null_percent=1
WriteCsvStringAllQuotes/10    1041745 ns      1041702 ns          646 bytes_per_second=682.782M/s null_percent=10
WriteCsvStringAllQuotes/50     889888 ns       889870 ns          786 bytes_per_second=454.407M/s null_percent=50
{code}

> [C++] Optimize csv writer for string with quotes
> ------------------------------------------------
>
>                 Key: ARROW-15878
>                 URL: https://issues.apache.org/jira/browse/ARROW-15878
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yibo Cai
>            Assignee: Yibo Cai
>            Priority: Major
>
> Escaping a string with quotes (put an extra quote before a quote) is the hotspot of csv writer [1]. This can probably be improved, possible approaches:
>  - Find the next quote with memchr, then memcpy blocks without quotes.
>  - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if no, otherwise go slow path.
> Should make sure the method doesn't decrease performance too much for strings with many quotes. And should be similar or better performance for short strings, which is common case.
> [1] [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)