You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2018/10/16 12:05:05 UTC
Algorithmic explorations of bitmaps vs. sentinel values
hi folks,
I explored a bit the performance implications of using validity
bitmaps (like the Arrow columnar format) vs. sentinel values (like
NaN, INT32_MIN) for nulls:
http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
The vectorization results may be of interest to those implementing
analytic functions targeting the Arrow memory format. There's probably
some other optimizations that can be employed, too.
Caveat: it's entirely possible I made some mistakes in my code. I
checked the various implementations for correctness only, and did not
dig too deeply beyond that.
- Wes
Re: Algorithmic explorations of bitmaps vs. sentinel values
Posted by Ted Dunning <te...@gmail.com>.
Memory binding can be viewed as opportunity for melding multiple
aggregators.
For instance, any additional aggregation comes nearly for free. Sum and
count (non zero) will be the same as either alone. Or sum and sum of
squares.
On Wed, Oct 17, 2018, 06:21 Francois Saint-Jacques <
fsaintjacques@networkdump.com> wrote:
> Don't trust that the compiler will auto-vectorize, tiny changes can have
> drastic impacts.
>
> SumScalar (benchmark-native):
> │ state->total += *values++;
> 42 │ add (%rcx),%esi
> 68 │ mov %esi,(%rsp)
> 3 │ add 0x4(%rcx),%esi
> 36 │ mov %esi,(%rsp)
> │ add 0x8(%rcx),%esi
> 86 │ mov %esi,(%rsp)
> │ add 0xc(%rcx),%esi
> 51 │ mov %esi,(%rsp)
> 27 │ add 0x10(%rcx),%esi
> 05 │ mov %esi,(%rsp)
> 45 │ add 0x14(%rcx),%esi
> 22 │ mov %esi,(%rsp)
> 2 │ add 0x18(%rcx),%esi
> 68 │ mov %esi,(%rsp)
> 1 │ add 0x1c(%rcx),%esi
> 1 │ add $0x20,%rcx
> 54 │ mov %esi,(%rsp)
>
> SumScalarRegister (benchmark-native):
> │ sum += *values++;
>
>
>
>
> 228 │ vpaddd (%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
> 164 │ vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
> 163 │ vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
> 438 │ vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
> 207 │ vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
> 146 │ vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
> 176 │ vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
> 447 │ vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
> 210 │ vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
> 161 │ vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
> 173 │ vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
> 392 │ vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
> 217 │ vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
> 153 │ vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
> 156 │ vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
> 654 │ vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3
>
> perf strongly indicate that is highly memory bound, I would
> expect/guesstimate the bitmap version to be slower when the memory
> bandwidth is at full usage (multiple threads computing vectors?) which
> might not show in a single thread benchmark.
>
> On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <we...@gmail.com> wrote:
>
> > I'm surprised that using a stack variable has an impact, but I should
> > probably update the benchmarks to do that (and merge with the
> > SumState<T> at the end of the function) for thoroughness. Thanks!
> > On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
> > <fs...@networkdump.com> wrote:
> > >
> > > It seems the code for the naive Scalar example is not friendly with the
> > > compiler auto-vectorization component. If you accumulate in a local
> state
> > > (instead of SumState pointer), you'll get different results. at least
> > with
> > > clang++6.0.
> > >
> > > benchmark-noavx (only SSE):
> > > BM_SumInt32Scalar 42390 us 42389 us
> 16
> > > bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> > > BM_SumInt32ScalarRegister 28036 us 28035 us
> 25
> > > bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> > > BM_SumInt32ScalarUnrolled 31660 us 31659 us
> 22
> > > bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> > > BM_SumInt32ScalarSentinel 49541 us 49540 us
> 14
> > > bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> > > BM_SumInt32ScalarSentinelRegister 55655 us 55654 us
> 12
> > > bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> > > BM_SumInt64Scalar 65030 us 65030 us
> 11
> > > bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> > > BM_SumInt64ScalarRegister 55370 us 55369 us
> 13
> > > bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> > > BM_SumInt64ScalarUnrolled 61287 us 61286 us
> 11
> > > bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> > > BM_SumInt64ScalarSentinel 69792 us 69792 us
> 10
> > > bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> > > BM_SumInt64ScalarSentinelRegister 68418 us 68416 us
> 10
> > > bytes_per_second=10.89G/s items_per_second=1.46164G/
> > >
> > > benchmark-native (AVX512):
> > > BM_SumInt32Scalar 43981 us 43980 us
> 16
> > > bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> > > BM_SumInt32ScalarRegister 28381 us 28380 us
> 24
> > > bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> > > BM_SumInt32ScalarUnrolled 79035 us 79033 us
> 9
> > > bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> > > BM_SumInt32ScalarSentinel 53430 us 53429 us
> 13
> > > bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> > > BM_SumInt32ScalarSentinelRegister 32450 us 32450 us
> 22
> > > bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> > > BM_SumInt64Scalar 68112 us 68111 us
> 10
> > > bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> > > BM_SumInt64ScalarRegister 56188 us 56187 us
> 12
> > > bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> > > BM_SumInt64ScalarUnrolled 65649 us 65648 us
> 11
> > > bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> > > BM_SumInt64ScalarSentinel 69674 us 69673 us
> 10
> > > bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> > > BM_SumInt64ScalarSentinelRegister 59200 us 59199 us
> 12
> > > bytes_per_second=12.5857G/s items_per_second=1.68922G/s
> > >
> > > The interesting part here is that the ScalarSentinelRegister version is
> > > very close the the fully vectorized ScalarRegister version on AVX512.
> > > Antoine, the compiler is doing exactly what you're suggesting:
> > >
> > > 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9
> > >
> > > 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10
> > >
> > > 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11
> > >
> > > │ static inline bool NotNull(int64_t val) { return val !=
> > > null_sentinel; }
> > > │ vpcmpneqq %ymm13,%ymm8,%k2
> > >
> > > 25 │ vpcmpneqq %ymm13,%ymm9,%k3
> > >
> > > 20 │ vpcmpneqq %ymm13,%ymm10,%k4
> > >
> > > 72 │ vpcmpneqq %ymm13,%ymm11,%k1
> > >
> > > │ if (Traits<T>::NotNull(*values)) {
> > >
> > > │ vpbroadcastq %rbx,%ymm12{%k2}{z}
> > >
> > > 15 │ vpaddq %ymm12,%ymm3,%ymm3
> > >
> > > 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z}
> > >
> > > 57 │ vpaddq %ymm12,%ymm5,%ymm5
> > >
> > > │ vpbroadcastq %rbx,%ymm12{%k4}{z}
> > >
> > > 39 │ vpaddq %ymm12,%ymm6,%ymm6
> > >
> > > 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z}
> > >
> > > 88 │ vpaddq %ymm12,%ymm7,%ymm7
> > >
> > > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> > > didn't have time to look at the other Sum (bitmap) implementation and
> > > didn't have time to clean the code. I'll try improve this over the
> > weekend.
> > >
> > > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi folks,
> > > >
> > > > I explored a bit the performance implications of using validity
> > > > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > > > NaN, INT32_MIN) for nulls:
> > > >
> > > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > > >
> > > > The vectorization results may be of interest to those implementing
> > > > analytic functions targeting the Arrow memory format. There's
> probably
> > > > some other optimizations that can be employed, too.
> > > >
> > > > Caveat: it's entirely possible I made some mistakes in my code. I
> > > > checked the various implementations for correctness only, and did not
> > > > dig too deeply beyond that.
> > > >
> > > > - Wes
> > > >
> > >
> > >
> > > --
> > > Sent from my jetpack.
> >
>
>
> --
> Sent from my jetpack.
>
Re: Algorithmic explorations of bitmaps vs. sentinel values
Posted by Francois Saint-Jacques <fs...@networkdump.com>.
Don't trust that the compiler will auto-vectorize, tiny changes can have
drastic impacts.
SumScalar (benchmark-native):
│ state->total += *values++;
42 │ add (%rcx),%esi
68 │ mov %esi,(%rsp)
3 │ add 0x4(%rcx),%esi
36 │ mov %esi,(%rsp)
│ add 0x8(%rcx),%esi
86 │ mov %esi,(%rsp)
│ add 0xc(%rcx),%esi
51 │ mov %esi,(%rsp)
27 │ add 0x10(%rcx),%esi
05 │ mov %esi,(%rsp)
45 │ add 0x14(%rcx),%esi
22 │ mov %esi,(%rsp)
2 │ add 0x18(%rcx),%esi
68 │ mov %esi,(%rsp)
1 │ add 0x1c(%rcx),%esi
1 │ add $0x20,%rcx
54 │ mov %esi,(%rsp)
SumScalarRegister (benchmark-native):
│ sum += *values++;
228 │ vpaddd (%rcx,%rdi,4),%zmm0,%zmm0
164 │ vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1
163 │ vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2
438 │ vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3
207 │ vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0
146 │ vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1
176 │ vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2
447 │ vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3
210 │ vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0
161 │ vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1
173 │ vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2
392 │ vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3
217 │ vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0
153 │ vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1
156 │ vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2
654 │ vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3
perf strongly indicate that is highly memory bound, I would
expect/guesstimate the bitmap version to be slower when the memory
bandwidth is at full usage (multiple threads computing vectors?) which
might not show in a single thread benchmark.
On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <we...@gmail.com> wrote:
> I'm surprised that using a stack variable has an impact, but I should
> probably update the benchmarks to do that (and merge with the
> SumState<T> at the end of the function) for thoroughness. Thanks!
> On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
> <fs...@networkdump.com> wrote:
> >
> > It seems the code for the naive Scalar example is not friendly with the
> > compiler auto-vectorization component. If you accumulate in a local state
> > (instead of SumState pointer), you'll get different results. at least
> with
> > clang++6.0.
> >
> > benchmark-noavx (only SSE):
> > BM_SumInt32Scalar 42390 us 42389 us 16
> > bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> > BM_SumInt32ScalarRegister 28036 us 28035 us 25
> > bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> > BM_SumInt32ScalarUnrolled 31660 us 31659 us 22
> > bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> > BM_SumInt32ScalarSentinel 49541 us 49540 us 14
> > bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> > BM_SumInt32ScalarSentinelRegister 55655 us 55654 us 12
> > bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> > BM_SumInt64Scalar 65030 us 65030 us 11
> > bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> > BM_SumInt64ScalarRegister 55370 us 55369 us 13
> > bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> > BM_SumInt64ScalarUnrolled 61287 us 61286 us 11
> > bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> > BM_SumInt64ScalarSentinel 69792 us 69792 us 10
> > bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> > BM_SumInt64ScalarSentinelRegister 68418 us 68416 us 10
> > bytes_per_second=10.89G/s items_per_second=1.46164G/
> >
> > benchmark-native (AVX512):
> > BM_SumInt32Scalar 43981 us 43980 us 16
> > bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> > BM_SumInt32ScalarRegister 28381 us 28380 us 24
> > bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> > BM_SumInt32ScalarUnrolled 79035 us 79033 us 9
> > bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> > BM_SumInt32ScalarSentinel 53430 us 53429 us 13
> > bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> > BM_SumInt32ScalarSentinelRegister 32450 us 32450 us 22
> > bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> > BM_SumInt64Scalar 68112 us 68111 us 10
> > bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> > BM_SumInt64ScalarRegister 56188 us 56187 us 12
> > bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> > BM_SumInt64ScalarUnrolled 65649 us 65648 us 11
> > bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> > BM_SumInt64ScalarSentinel 69674 us 69673 us 10
> > bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> > BM_SumInt64ScalarSentinelRegister 59200 us 59199 us 12
> > bytes_per_second=12.5857G/s items_per_second=1.68922G/s
> >
> > The interesting part here is that the ScalarSentinelRegister version is
> > very close the the fully vectorized ScalarRegister version on AVX512.
> > Antoine, the compiler is doing exactly what you're suggesting:
> >
> > 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9
> >
> > 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10
> >
> > 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11
> >
> > │ static inline bool NotNull(int64_t val) { return val !=
> > null_sentinel; }
> > │ vpcmpneqq %ymm13,%ymm8,%k2
> >
> > 25 │ vpcmpneqq %ymm13,%ymm9,%k3
> >
> > 20 │ vpcmpneqq %ymm13,%ymm10,%k4
> >
> > 72 │ vpcmpneqq %ymm13,%ymm11,%k1
> >
> > │ if (Traits<T>::NotNull(*values)) {
> >
> > │ vpbroadcastq %rbx,%ymm12{%k2}{z}
> >
> > 15 │ vpaddq %ymm12,%ymm3,%ymm3
> >
> > 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z}
> >
> > 57 │ vpaddq %ymm12,%ymm5,%ymm5
> >
> > │ vpbroadcastq %rbx,%ymm12{%k4}{z}
> >
> > 39 │ vpaddq %ymm12,%ymm6,%ymm6
> >
> > 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z}
> >
> > 88 │ vpaddq %ymm12,%ymm7,%ymm7
> >
> > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> > didn't have time to look at the other Sum (bitmap) implementation and
> > didn't have time to clean the code. I'll try improve this over the
> weekend.
> >
> > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi folks,
> > >
> > > I explored a bit the performance implications of using validity
> > > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > > NaN, INT32_MIN) for nulls:
> > >
> > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > >
> > > The vectorization results may be of interest to those implementing
> > > analytic functions targeting the Arrow memory format. There's probably
> > > some other optimizations that can be employed, too.
> > >
> > > Caveat: it's entirely possible I made some mistakes in my code. I
> > > checked the various implementations for correctness only, and did not
> > > dig too deeply beyond that.
> > >
> > > - Wes
> > >
> >
> >
> > --
> > Sent from my jetpack.
>
--
Sent from my jetpack.
Re: Algorithmic explorations of bitmaps vs. sentinel values
Posted by Wes McKinney <we...@gmail.com>.
I'm surprised that using a stack variable has an impact, but I should
probably update the benchmarks to do that (and merge with the
SumState<T> at the end of the function) for thoroughness. Thanks!
On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
<fs...@networkdump.com> wrote:
>
> It seems the code for the naive Scalar example is not friendly with the
> compiler auto-vectorization component. If you accumulate in a local state
> (instead of SumState pointer), you'll get different results. at least with
> clang++6.0.
>
> benchmark-noavx (only SSE):
> BM_SumInt32Scalar 42390 us 42389 us 16
> bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> BM_SumInt32ScalarRegister 28036 us 28035 us 25
> bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> BM_SumInt32ScalarUnrolled 31660 us 31659 us 22
> bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> BM_SumInt32ScalarSentinel 49541 us 49540 us 14
> bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> BM_SumInt32ScalarSentinelRegister 55655 us 55654 us 12
> bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> BM_SumInt64Scalar 65030 us 65030 us 11
> bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> BM_SumInt64ScalarRegister 55370 us 55369 us 13
> bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> BM_SumInt64ScalarUnrolled 61287 us 61286 us 11
> bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> BM_SumInt64ScalarSentinel 69792 us 69792 us 10
> bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> BM_SumInt64ScalarSentinelRegister 68418 us 68416 us 10
> bytes_per_second=10.89G/s items_per_second=1.46164G/
>
> benchmark-native (AVX512):
> BM_SumInt32Scalar 43981 us 43980 us 16
> bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> BM_SumInt32ScalarRegister 28381 us 28380 us 24
> bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> BM_SumInt32ScalarUnrolled 79035 us 79033 us 9
> bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> BM_SumInt32ScalarSentinel 53430 us 53429 us 13
> bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> BM_SumInt32ScalarSentinelRegister 32450 us 32450 us 22
> bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> BM_SumInt64Scalar 68112 us 68111 us 10
> bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> BM_SumInt64ScalarRegister 56188 us 56187 us 12
> bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> BM_SumInt64ScalarUnrolled 65649 us 65648 us 11
> bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> BM_SumInt64ScalarSentinel 69674 us 69673 us 10
> bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> BM_SumInt64ScalarSentinelRegister 59200 us 59199 us 12
> bytes_per_second=12.5857G/s items_per_second=1.68922G/s
>
> The interesting part here is that the ScalarSentinelRegister version is
> very close the the fully vectorized ScalarRegister version on AVX512.
> Antoine, the compiler is doing exactly what you're suggesting:
>
> 374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9
>
> 48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10
>
> 230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11
>
> │ static inline bool NotNull(int64_t val) { return val !=
> null_sentinel; }
> │ vpcmpneqq %ymm13,%ymm8,%k2
>
> 25 │ vpcmpneqq %ymm13,%ymm9,%k3
>
> 20 │ vpcmpneqq %ymm13,%ymm10,%k4
>
> 72 │ vpcmpneqq %ymm13,%ymm11,%k1
>
> │ if (Traits<T>::NotNull(*values)) {
>
> │ vpbroadcastq %rbx,%ymm12{%k2}{z}
>
> 15 │ vpaddq %ymm12,%ymm3,%ymm3
>
> 9 │ vpbroadcastq %rbx,%ymm12{%k3}{z}
>
> 57 │ vpaddq %ymm12,%ymm5,%ymm5
>
> │ vpbroadcastq %rbx,%ymm12{%k4}{z}
>
> 39 │ vpaddq %ymm12,%ymm6,%ymm6
>
> 22 │ vpbroadcastq %rbx,%ymm12{%k1}{z}
>
> 88 │ vpaddq %ymm12,%ymm7,%ymm7
>
> See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> didn't have time to look at the other Sum (bitmap) implementation and
> didn't have time to clean the code. I'll try improve this over the weekend.
>
> On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > I explored a bit the performance implications of using validity
> > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > NaN, INT32_MIN) for nulls:
> >
> > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >
> > The vectorization results may be of interest to those implementing
> > analytic functions targeting the Arrow memory format. There's probably
> > some other optimizations that can be employed, too.
> >
> > Caveat: it's entirely possible I made some mistakes in my code. I
> > checked the various implementations for correctness only, and did not
> > dig too deeply beyond that.
> >
> > - Wes
> >
>
>
> --
> Sent from my jetpack.
Re: Algorithmic explorations of bitmaps vs. sentinel values
Posted by Francois Saint-Jacques <fs...@networkdump.com>.
It seems the code for the naive Scalar example is not friendly with the
compiler auto-vectorization component. If you accumulate in a local state
(instead of SumState pointer), you'll get different results. at least with
clang++6.0.
benchmark-noavx (only SSE):
BM_SumInt32Scalar 42390 us 42389 us 16
bytes_per_second=8.78824G/s items_per_second=2.35908G/s
BM_SumInt32ScalarRegister 28036 us 28035 us 25
bytes_per_second=13.2878G/s items_per_second=3.56691G/s
BM_SumInt32ScalarUnrolled 31660 us 31659 us 22
bytes_per_second=11.7668G/s items_per_second=3.15863G/s
BM_SumInt32ScalarSentinel 49541 us 49540 us 14
bytes_per_second=7.51971G/s items_per_second=2.01856G/s
BM_SumInt32ScalarSentinelRegister 55655 us 55654 us 12
bytes_per_second=6.69369G/s items_per_second=1.79683G/s
BM_SumInt64Scalar 65030 us 65030 us 11
bytes_per_second=11.4572G/s items_per_second=1.53776G/s
BM_SumInt64ScalarRegister 55370 us 55369 us 13
bytes_per_second=13.4563G/s items_per_second=1.80607G/s
BM_SumInt64ScalarUnrolled 61287 us 61286 us 11
bytes_per_second=12.1571G/s items_per_second=1.6317G/s
BM_SumInt64ScalarSentinel 69792 us 69792 us 10
bytes_per_second=10.6755G/s items_per_second=1.43284G/s
BM_SumInt64ScalarSentinelRegister 68418 us 68416 us 10
bytes_per_second=10.89G/s items_per_second=1.46164G/
benchmark-native (AVX512):
BM_SumInt32Scalar 43981 us 43980 us 16
bytes_per_second=8.47046G/s items_per_second=2.27377G/s
BM_SumInt32ScalarRegister 28381 us 28380 us 24
bytes_per_second=13.1263G/s items_per_second=3.52355G/s
BM_SumInt32ScalarUnrolled 79035 us 79033 us 9
bytes_per_second=4.71361G/s items_per_second=1.2653G/s
BM_SumInt32ScalarSentinel 53430 us 53429 us 13
bytes_per_second=6.97235G/s items_per_second=1.87163G/s
BM_SumInt32ScalarSentinelRegister 32450 us 32450 us 22
bytes_per_second=11.4802G/s items_per_second=3.0817G/s
BM_SumInt64Scalar 68112 us 68111 us 10
bytes_per_second=10.9389G/s items_per_second=1.46819G/s
BM_SumInt64ScalarRegister 56188 us 56187 us 12
bytes_per_second=13.2603G/s items_per_second=1.77977G/s
BM_SumInt64ScalarUnrolled 65649 us 65648 us 11
bytes_per_second=11.3492G/s items_per_second=1.52327G/s
BM_SumInt64ScalarSentinel 69674 us 69673 us 10
bytes_per_second=10.6937G/s items_per_second=1.43528G/s
BM_SumInt64ScalarSentinelRegister 59200 us 59199 us 12
bytes_per_second=12.5857G/s items_per_second=1.68922G/s
The interesting part here is that the ScalarSentinelRegister version is
very close the the fully vectorized ScalarRegister version on AVX512.
Antoine, the compiler is doing exactly what you're suggesting:
374 │ vmovdqu 0x20(%rcx,%rdi,8),%ymm9
48 │ vmovdqu 0x40(%rcx,%rdi,8),%ymm10
230 │ vmovdqu 0x60(%rcx,%rdi,8),%ymm11
│ static inline bool NotNull(int64_t val) { return val !=
null_sentinel; }
│ vpcmpneqq %ymm13,%ymm8,%k2
25 │ vpcmpneqq %ymm13,%ymm9,%k3
20 │ vpcmpneqq %ymm13,%ymm10,%k4
72 │ vpcmpneqq %ymm13,%ymm11,%k1
│ if (Traits<T>::NotNull(*values)) {
│ vpbroadcastq %rbx,%ymm12{%k2}{z}
15 │ vpaddq %ymm12,%ymm3,%ymm3
9 │ vpbroadcastq %rbx,%ymm12{%k3}{z}
57 │ vpaddq %ymm12,%ymm5,%ymm5
│ vpbroadcastq %rbx,%ymm12{%k4}{z}
39 │ vpaddq %ymm12,%ymm6,%ymm6
22 │ vpbroadcastq %rbx,%ymm12{%k1}{z}
88 │ vpaddq %ymm12,%ymm7,%ymm7
See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
didn't have time to look at the other Sum (bitmap) implementation and
didn't have time to clean the code. I'll try improve this over the weekend.
On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com> wrote:
> hi folks,
>
> I explored a bit the performance implications of using validity
> bitmaps (like the Arrow columnar format) vs. sentinel values (like
> NaN, INT32_MIN) for nulls:
>
> http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>
> The vectorization results may be of interest to those implementing
> analytic functions targeting the Arrow memory format. There's probably
> some other optimizations that can be employed, too.
>
> Caveat: it's entirely possible I made some mistakes in my code. I
> checked the various implementations for correctness only, and did not
> dig too deeply beyond that.
>
> - Wes
>
--
Sent from my jetpack.
Re: Algorithmic explorations of bitmaps vs. sentinel values
Posted by Antoine Pitrou <an...@python.org>.
Le 16/10/2018 à 18:55, Ted Dunning a écrit :
> It should be possible to unroll the sentinel version in many cases. For
> instance,
>
> sum += (data[i] == SENTINEL) * data[i]
>
> This doesn't work with NaN as a sentinel because 0 * NaN => NaN, but it can
> work with other values.
One should also explore whether the multiplication pattern is really
optimal. Perhaps a condition would work better. The compiler may be
able to turn it into a cheap conditional move (to suppress branch
misprediction costs).
Regards
Antoine.
>
>
> On Tue, Oct 16, 2018 at 9:38 AM Antoine Pitrou <an...@python.org> wrote:
>
>>
>> Hi Wes,
>>
>> Le 16/10/2018 à 14:05, Wes McKinney a écrit :
>>> hi folks,
>>>
>>> I explored a bit the performance implications of using validity
>>> bitmaps (like the Arrow columnar format) vs. sentinel values (like
>>> NaN, INT32_MIN) for nulls:
>>>
>>> http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>
>>> The vectorization results may be of interest to those implementing
>>> analytic functions targeting the Arrow memory format. There's probably
>>> some other optimizations that can be employed, too.
>>
>> This is a nice write-up. It may also possible to further speed up
>> things using explicit SIMD operations.
>>
>> For the non-null case, it should be relatively doable, see e.g.
>> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/int/reduce_add.html
>> or
>> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/fp/reduce_add.html
>> .
>>
>> For the with-nulls case, it might be possible to do something with SIMD
>> masks, but I'm not competent to propose anything concrete :-)
>>
>> Regards
>>
>> Antoine.
>>
>>
>>>
>>> Caveat: it's entirely possible I made some mistakes in my code. I
>>> checked the various implementations for correctness only, and did not
>>> dig too deeply beyond that.
>>>
>>> - Wes
>>>
>>
>
Re: Algorithmic explorations of bitmaps vs. sentinel values
Posted by Ted Dunning <te...@gmail.com>.
It should be possible to unroll the sentinel version in many cases. For
instance,
sum += (data[i] == SENTINEL) * data[i]
This doesn't work with NaN as a sentinel because 0 * NaN => NaN, but it can
work with other values.
On Tue, Oct 16, 2018 at 9:38 AM Antoine Pitrou <an...@python.org> wrote:
>
> Hi Wes,
>
> Le 16/10/2018 à 14:05, Wes McKinney a écrit :
> > hi folks,
> >
> > I explored a bit the performance implications of using validity
> > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > NaN, INT32_MIN) for nulls:
> >
> > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >
> > The vectorization results may be of interest to those implementing
> > analytic functions targeting the Arrow memory format. There's probably
> > some other optimizations that can be employed, too.
>
> This is a nice write-up. It may also possible to further speed up
> things using explicit SIMD operations.
>
> For the non-null case, it should be relatively doable, see e.g.
> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/int/reduce_add.html
> or
> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/fp/reduce_add.html
> .
>
> For the with-nulls case, it might be possible to do something with SIMD
> masks, but I'm not competent to propose anything concrete :-)
>
> Regards
>
> Antoine.
>
>
> >
> > Caveat: it's entirely possible I made some mistakes in my code. I
> > checked the various implementations for correctness only, and did not
> > dig too deeply beyond that.
> >
> > - Wes
> >
>
Re: Algorithmic explorations of bitmaps vs. sentinel values
Posted by Antoine Pitrou <an...@python.org>.
Hi Wes,
Le 16/10/2018 à 14:05, Wes McKinney a écrit :
> hi folks,
>
> I explored a bit the performance implications of using validity
> bitmaps (like the Arrow columnar format) vs. sentinel values (like
> NaN, INT32_MIN) for nulls:
>
> http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>
> The vectorization results may be of interest to those implementing
> analytic functions targeting the Arrow memory format. There's probably
> some other optimizations that can be employed, too.
This is a nice write-up. It may also possible to further speed up
things using explicit SIMD operations.
For the non-null case, it should be relatively doable, see e.g.
https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/int/reduce_add.html
or
https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/fp/reduce_add.html .
For the with-nulls case, it might be possible to do something with SIMD
masks, but I'm not competent to propose anything concrete :-)
Regards
Antoine.
>
> Caveat: it's entirely possible I made some mistakes in my code. I
> checked the various implementations for correctness only, and did not
> dig too deeply beyond that.
>
> - Wes
>