You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2020/09/28 06:18:00 UTC

[jira] [Comment Edited] (ARROW-10058) [C++] Investigate performance of LevelsToBitmap without BMI2

    [ https://issues.apache.org/jira/browse/ARROW-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203017#comment-17203017 ] 

Yibo Cai edited comment on ARROW-10058 at 9/28/20, 6:17 AM:
------------------------------------------------------------

Kind of lost in the code (HAVE_BMI2, HAVE_RUNTIME_BMI2, runtime dispatch). Looks I should benchmark against the scalar implementation [1], not the simd code with simulated pext [2].
"cmake - DARROW_RUNTIME_SIME_LEVEL=SSE4_2"  enables scalar code on my skylake machine. And the benchmark result is 790M/s. So it's still promising that lookup table can improve performance big. Maybe we can drop the scalar implementation and simply the code.

[1] https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L39
[2] https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion_inc.h#L75


was (Author: yibo):
Kind of lost in the code (HAVE_BMI2, HAVE_RUNTIME_BMI2, runtime dispatch). Looks I should benchmark against the scalar implementation [1], not the simd code with simulated pext [2].
"cmake - DARROW_RUNTIME_SIME_LEVEL=SSE4_2"  enables scalar code on my skylake machine. And the benchmark result is 790M/s. So it's still promising that lookup table can improve performance big.

[1] https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L39
[2] https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion_inc.h#L75

> [C++] Investigate performance of LevelsToBitmap without BMI2
> ------------------------------------------------------------
>
>                 Key: ARROW-10058
>                 URL: https://issues.apache.org/jira/browse/ARROW-10058
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Antoine Pitrou
>            Priority: Major
>         Attachments: opt-level-conv.diff
>
>
> Currently, when some Parquet nested data involves some repetition levels, converting the levels to bitmap goes through a slow scalar path unless the BMI2 instruction set is available and efficient (the latter using the PEXT instruction to process 16 levels at once).
> It may be possible to emulate PEXT for 5- or 6-bit masks by using a lookup table, allowing to process 5-6 levels at once.
> (also, it would be good to add nested reading benchmarks for non-trivial nesting; currently we only benchmark one-level struct and one-level list)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)