You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2021/03/29 05:32:00 UTC

[jira] [Comment Edited] (ARROW-9842) [C++] Explore alternative strategy for Compare kernel implementation for better performance

    [ https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310428#comment-17310428 ] 

Yibo Cai edited comment on ARROW-9842 at 3/29/21, 5:31 AM:
-----------------------------------------------------------

Tested on skylake, clang-9.

Updated POC patch to evaluate best chunk size to accumulate bits before packing.
 [^movemask-in-chunks.diff]

To my surprise, big chunks actually hurts performance (tested with arrow-compute-scalar-compare-benchmark). Chunk size 16 gives ~3G items/sec, while chunk size 256 gives ~2G.
My theory is for big chunk size, cpu has to stall for memory loading. For small chunk size, cpu can interleave memory loading and latter computation (packing bytes to bits). I see IPC (instruction per cycle) drops from 2.4 to 2.2, when chunk size increases from 16 to 64.


was (Author: yibocai):
Updated POC patch to evaluate best chunk size to accumulate bits before packing.
 [^movemask-in-chunks.diff]

To my surprise, big chunks actually hurts performance (tested with arrow-compute-scalar-compare-benchmark). Chunk size 16 gives ~3G items/sec, while chunk size 256 gives ~2G.
My theory is for big chunk size, cpu has to stall for memory loading. For small chunk size, cpu can interleave memory loading and latter computation (packing bytes to bits). I see IPC (instruction per cycle) drops from 2.4 to 2.2, when chunk size increases from 16 to 64.

> [C++] Explore alternative strategy for Compare kernel implementation for better performance
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9842
>                 URL: https://issues.apache.org/jira/browse/ARROW-9842
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 5.0.0
>
>         Attachments: movemask-in-chunks.diff, movemask.patch
>
>
> The compiler may be able to vectorize comparison options if the bitpacking of results is deferred until the end (or in chunks). Instead, a temporary bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can be bitpacked into the output buffer. This may also reduce the code size of the compare kernels (which are actually quite large at the moment)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)