You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/26 18:13:40 UTC

[GitHub] [arrow-rs] jhorstmann opened a new issue, #1949: AVX512 optimized filter kernels for primitive types

jhorstmann opened a new issue, #1949:
URL: https://github.com/apache/arrow-rs/issues/1949

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   In #1829 we removed AVX512 optimizations for AND/OR kernels since the autovectorized code was just as good, but there are some AVX512 instructions that could have a big benefit and which the compiler would not be able to use automatically. One of those extensions is the [`compressstore` instruction](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=compressstore&ig_expand=1426) which basically implements most of the filter kernel in a single instruction.
   
   [I recently experimented with those](https://github.com/apache/arrow-rs/compare/master...jhorstmann:experiment-avx512-filter-kernel#diff-b0c344c0a7b4a8b292fe211fc32d2f88a9626d8a8b574e131df82250decf0d67R89) and found that, while our current filters are extremely good for extreme selectivities thanks to all the optimizations that @tustvold did, for selectivities between 5% and 99% the AVX512 version would be faster. For a random selectivity of 50% nearly 10x faster.
   
   **Describe the solution you'd like**
   
   There are a few open questions how to best integrate these functions into the filter kernels. They don't fit that well into the existing strategies, since they would be specific to primitive arrays, and there might be different selectivity cutoffs for falling back to one of the existing strategies.
   
   We would also need to decide whether to statically dispatch to these kernels, based on `target-cpu` or `target-feature`, or use runtime feature detection.
   
   The 8 and 16 bit versions of these instructions are also only available since the `icelake` generation, making testing a bit more difficult.
   
   **Describe alternatives you've considered**
   
   There is a [discussion on the portable-simd about portable alternatives to these instructions](https://github.com/rust-lang/portable-simd/issues/240) but that would require quite some work in llvm, since there are not portable llvm intrinsics yet, only the x86/avx512 implementations.
   
   **Additional context**
   
   Benchmark results for filtering i32 running on a `tigerlake` machine running at 3Ghz:
   
   ```
   Gnuplot not found, using plotters backend
   filter i32 (kept 50%)   time:   [55.624 us 55.657 us 55.699 us]                                  
   
   filter i32 high selectivity (kept 95%)                                                                             
                           time:   [18.635 us 18.650 us 18.671 us]
   
   filter i32 low selectivity (kept 5%)                                                                             
                           time:   [5.6434 us 5.6778 us 5.7203 us]
   
   filter i32 avx512 (kept 50%)                                                                             
                           time:   [6.0487 us 6.0529 us 6.0579 us]
   
   filter i32 avx512 high selectivity (kept 95%)                                                                             
                           time:   [6.2818 us 6.2847 us 6.2879 us]
   
   filter i32 avx512 low selectivity (kept 5%)                                                                             
                           time:   [5.4591 us 5.4618 us 5.4651 us]
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org