You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2022/03/09 07:55:00 UTC
[jira] [Commented] (ARROW-15878) [C++] Optimize csv writer for string with quotes
[ https://issues.apache.org/jira/browse/ARROW-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503359#comment-17503359 ]
Yibo Cai commented on ARROW-15878:
----------------------------------
Compared two possible optimization approaches. Not very satisfying with the results.
Benchmarked two secnarios:
- WriteCsvStringWithQuote is the best case, there is only one quote at string end
- WriteCsvStringAllQuotes is the worse case, the whole string is filled with quote
*appoach0(baseline): naive char by char copying*
{code:bash}
WriteCsvStringWithQuote/0 938246 ns 938230 ns 745 bytes_per_second=490.029M/s null_percent=0
WriteCsvStringWithQuote/1 1014895 ns 1014890 ns 688 bytes_per_second=448.751M/s null_percent=1
WriteCsvStringWithQuote/10 1060796 ns 1060780 ns 659 bytes_per_second=393.06M/s null_percent=10
WriteCsvStringWithQuote/50 891765 ns 891760 ns 786 bytes_per_second=269.686M/s null_percent=50
WriteCsvStringAllQuotes/0 1001146 ns 1001109 ns 699 bytes_per_second=785.086M/s null_percent=0
WriteCsvStringAllQuotes/1 1053971 ns 1053956 ns 664 bytes_per_second=738.526M/s null_percent=1
WriteCsvStringAllQuotes/10 1102326 ns 1102258 ns 655 bytes_per_second=645.272M/s null_percent=10
WriteCsvStringAllQuotes/50 894888 ns 894843 ns 781 bytes_per_second=451.882M/s null_percent=50
{code}
*approach1: continue finding next quote, memcpy*
- best case: improves 20%
- worse case: drops 70%
{code:bash}
WriteCsvStringWithQuote/0 785568 ns 785549 ns 889 bytes_per_second=585.272M/s null_percent=0
WriteCsvStringWithQuote/1 849845 ns 849834 ns 821 bytes_per_second=535.908M/s null_percent=1
WriteCsvStringWithQuote/10 885708 ns 885696 ns 790 bytes_per_second=470.76M/s null_percent=10
WriteCsvStringWithQuote/50 840687 ns 840662 ns 832 bytes_per_second=286.079M/s null_percent=50
WriteCsvStringAllQuotes/0 3606928 ns 3606876 ns 192 bytes_per_second=217.905M/s null_percent=0
WriteCsvStringAllQuotes/1 3765233 ns 3765083 ns 186 bytes_per_second=206.735M/s null_percent=1
WriteCsvStringAllQuotes/10 3686031 ns 3685964 ns 190 bytes_per_second=192.963M/s null_percent=10
WriteCsvStringAllQuotes/50 2362894 ns 2362807 ns 295 bytes_per_second=171.137M/s null_percent=50
{code}
*approach2: check 8 chars, memcpy if no quote, otherwise copy char by char*
- best case: improves 10%
- worst case: no difference
{code:bash}
WriteCsvStringWithQuote/0 862995 ns 862991 ns 809 bytes_per_second=532.751M/s null_percent=0
WriteCsvStringWithQuote/1 900671 ns 900650 ns 774 bytes_per_second=505.671M/s null_percent=1
WriteCsvStringWithQuote/10 896087 ns 896066 ns 779 bytes_per_second=465.312M/s null_percent=10
WriteCsvStringWithQuote/50 805413 ns 805363 ns 870 bytes_per_second=298.618M/s null_percent=50
WriteCsvStringAllQuotes/0 993539 ns 993503 ns 702 bytes_per_second=791.097M/s null_percent=0
WriteCsvStringAllQuotes/1 1043675 ns 1043650 ns 671 bytes_per_second=745.819M/s null_percent=1
WriteCsvStringAllQuotes/10 1041745 ns 1041702 ns 646 bytes_per_second=682.782M/s null_percent=10
WriteCsvStringAllQuotes/50 889888 ns 889870 ns 786 bytes_per_second=454.407M/s null_percent=50
{code}
> [C++] Optimize csv writer for string with quotes
> ------------------------------------------------
>
> Key: ARROW-15878
> URL: https://issues.apache.org/jira/browse/ARROW-15878
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Yibo Cai
> Assignee: Yibo Cai
> Priority: Major
>
> Escaping a string with quotes (put an extra quote before a quote) is the hotspot of csv writer [1]. This can probably be improved, possible approaches:
> - Find the next quote with memchr, then memcpy blocks without quotes.
> - Check if there are quotes with simd in 8 bytes or 16 bytes, do memcpy if no, otherwise go slow path.
> Should make sure the method doesn't decrease performance too much for strings with many quotes. And should be similar or better performance for short strings, which is common case.
> [1] [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.cc#L139]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)