You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/15 11:52:50 UTC
[GitHub] [arrow-rs] jhorstmann opened a new issue #1182: Evaluate performance of simd on simple arithmetic
jhorstmann opened a new issue #1182:
URL: https://github.com/apache/arrow-rs/issues/1182
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
For simple arithmetic kernels (+,-,*), the compiler should be able to automatically vectorize the scalar code and even generate better code than our custom simd implementations. Our simd kernels currently process a specific number of lanes at the same time, dependent on the element type, an autovectorized implementation can possibly get unrolled multiple times so it only has to check the loop condition every n lanes.
The checked division kernels probably still benefit from the custom simd implementation and should be kept.
**Describe the solution you'd like**
- add some more benchmarks in `arithmetic_kernels`
- make those benchmarks process a larger amount of data, currently the arrays are of length 512 and the overhead of allocation or validity bitmap calculation might dominate the actual arithmetic computation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb closed issue #1182: Evaluate performance of simd on simple arithmetic
Posted by GitBox <gi...@apache.org>.
alamb closed issue #1182:
URL: https://github.com/apache/arrow-rs/issues/1182
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb closed issue #1182: Evaluate performance of simd on simple arithmetic
Posted by GitBox <gi...@apache.org>.
alamb closed issue #1182:
URL: https://github.com/apache/arrow-rs/issues/1182
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] jhorstmann commented on issue #1182: Evaluate performance of simd on simple arithmetic
Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1182:
URL: https://github.com/apache/arrow-rs/issues/1182#issuecomment-1013669825
Benchmarks with array size of 64k, run on an AMD Ryzen 3700U laptop.
Compiled with `$ RUSTFLAGS="-C target-cpu=skylake"`
(The Skylake code generator in llvm seems to have received more tuning than Zen but the architecture is otherwise quite close)
With simd feature:
```
add time: [27.039 us 27.895 us 28.542 us]
subtract time: [26.408 us 27.194 us 28.040 us]
multiply time: [27.303 us 27.872 us 28.429 us]
divide time: [45.050 us 46.385 us 47.590 us]
divide_unchecked time: [27.806 us 29.673 us 31.300 us]
divide_scalar time: [21.931 us 23.228 us 24.801 us]
modulo time: [434.12 us 435.14 us 436.78 us]
modulo_scalar time: [1.5086 ms 1.6025 ms 1.7023 ms]
add_nulls time: [26.502 us 27.072 us 27.736 us]
divide_nulls time: [40.180 us 40.329 us 40.515 us]
divide_nulls_unchecked time: [32.199 us 32.461 us 32.743 us]
divide_scalar_nulls time: [33.899 us 33.911 us 33.922 us]
modulo_nulls time: [663.39 us 708.55 us 756.58 us]
modulo_scalar_nulls time: [1.1747 ms 1.2106 ms 1.2556 ms]
```
Without simd feature
```
add time: [12.703 us 13.485 us 14.165 us]
change: [-45.506% -41.284% -36.365%] (p = 0.00 < 0.05)
Performance has improved.
subtract time: [17.586 us 17.829 us 18.018 us]
change: [-42.087% -40.678% -39.176%] (p = 0.00 < 0.05)
Performance has improved.
multiply time: [16.261 us 16.638 us 17.035 us]
change: [-41.123% -39.753% -38.321%] (p = 0.00 < 0.05)
Performance has improved.
divide time: [97.142 us 104.55 us 111.13 us]
change: [+113.55% +125.75% +136.87%] (p = 0.00 < 0.05)
Performance has regressed.
divide_unchecked time: [24.008 us 24.153 us 24.328 us]
change: [-23.200% -19.850% -16.066%] (p = 0.00 < 0.05)
Performance has improved.
divide_scalar time: [14.484 us 15.522 us 16.626 us]
change: [-40.708% -36.430% -32.487%] (p = 0.00 < 0.05)
Performance has improved.
modulo time: [368.17 us 390.38 us 411.11 us]
change: [-9.5426% -6.4033% -3.4307%] (p = 0.00 < 0.05)
Performance has improved.
modulo_scalar time: [1.2196 ms 1.2890 ms 1.3754 ms]
change: [-14.441% -9.7393% -5.2267%] (p = 0.00 < 0.05)
Performance has improved.
add_nulls time: [17.075 us 17.339 us 17.598 us]
change: [-40.726% -39.565% -38.339%] (p = 0.00 < 0.05)
Performance has improved.
divide_nulls time: [409.69 us 437.35 us 460.53 us]
change: [+762.95% +804.17% +855.01%] (p = 0.00 < 0.05)
Performance has regressed.
divide_nulls_unchecked time: [24.312 us 24.435 us 24.566 us]
change: [-25.160% -24.640% -24.154%] (p = 0.00 < 0.05)
Performance has improved.
divide_scalar_nulls time: [18.353 us 19.136 us 19.961 us]
change: [-46.439% -44.289% -41.942%] (p = 0.00 < 0.05)
Performance has improved.
modulo_nulls time: [479.73 us 509.93 us 546.35 us]
change: [-26.438% -22.742% -18.697%] (p = 0.00 < 0.05)
Performance has improved.
modulo_scalar_nulls time: [1.0541 ms 1.1235 ms 1.1982 ms]
change: [-21.713% -17.154% -12.422%] (p = 0.00 < 0.05)
Performance has improved.
```
**Summary**: Autovectorized code is about 40% faster for simple arithmetic.
Division with nulls is 10x slower without simd, so we should keep that optimized implementation.
Modulo is about the same speed with and without simd, this cpu does not have a simd fmod instruction
and the implementation actually calls the libc `fmodf` function for each element in both versions.
**Assembly**: Autovectorized `multiply` kernel (inner loop), this computes 64 floats before checking the loop condition:
```
0,57 │320:┌─→vmovups ymm0,YMMWORD PTR [rdi+rsi*4-0xe0]
2,75 │ │ vmovups ymm1,YMMWORD PTR [rdi+rsi*4-0xc0]
3,16 │ │ vmulps ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0xe0]
8,51 │ │ vmulps ymm1,ymm1,YMMWORD PTR [rax+rsi*4-0xc0]
3,91 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0x0],ymm0
1,44 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0x20],ymm1
3,97 │ │ vmovups ymm0,YMMWORD PTR [rdi+rsi*4-0xa0]
2,57 │ │ vmovups ymm1,YMMWORD PTR [rdi+rsi*4-0x80]
1,59 │ │ vmulps ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0xa0]
11,66 │ │ vmulps ymm1,ymm1,YMMWORD PTR [rax+rsi*4-0x80]
3,86 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0x40],ymm0
1,39 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0x60],ymm1
4,00 │ │ vmovups ymm0,YMMWORD PTR [rdi+rsi*4-0x60]
2,58 │ │ vmovups ymm1,YMMWORD PTR [rdi+rsi*4-0x40]
1,59 │ │ vmulps ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0x60]
9,60 │ │ vmulps ymm1,ymm1,YMMWORD PTR [rax+rsi*4-0x40]
4,19 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0x80],ymm0
1,36 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0xa0],ymm1
4,21 │ │ vmovups ymm0,YMMWORD PTR [rdi+rsi*4-0x20]
3,98 │ │ vmovups ymm1,YMMWORD PTR [rdi+rsi*4]
1,77 │ │ vmulps ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0x20]
10,02 │ │ vmulps ymm1,ymm1,YMMWORD PTR [rax+rsi*4]
3,82 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0xc0],ymm0
2,31 │ │ vmovups YMMWORD PTR [rbp+rsi*4+0xe0],ymm1
4,07 │ │ add rsi,0x40
0,26 │ │ add rdx,0x4
│ └──jne 320
```
Custom simd code calculates 8 lanes and also contains additional bounds checks:
```
0,79 │1e0:┌─→cmp r15,rsi
3,88 │ │↓ je 214
0,00 │ │ cmp r13,rsi
0,54 │ │↓ je 214
│ │ vmovups ymm0,YMMWORD PTR [rdi+rsi*4]
12,98 │ │ vmovups ymm1,YMMWORD PTR [rdi+rsi*4+0x20]
11,00 │ │ vmulps ymm0,ymm0,YMMWORD PTR [rbx+rsi*4]
36,47 │ │ vmulps ymm1,ymm1,YMMWORD PTR [rbx+rsi*4+0x20]
15,89 │ │ vmovups YMMWORD PTR [rax+rsi*4+0x20],ymm1
10,02 │ │ vmovups YMMWORD PTR [rax+rsi*4],ymm0
5,77 │ │ add rsi,0x10
1,86 │ │ cmp r8,rsi
0,00 │ └──jne 1e0
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org