You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2018/10/16 12:05:05 UTC

Algorithmic explorations of bitmaps vs. sentinel values

hi folks,

I explored a bit the performance implications of using validity
bitmaps (like the Arrow columnar format) vs. sentinel values (like
NaN, INT32_MIN) for nulls:

http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/

The vectorization results may be of interest to those implementing
analytic functions targeting the Arrow memory format. There's probably
some other optimizations that can be employed, too.

Caveat: it's entirely possible I made some mistakes in my code. I
checked the various implementations for correctness only, and did not
dig too deeply beyond that.

- Wes

Re: Algorithmic explorations of bitmaps vs. sentinel values

Posted by Ted Dunning <te...@gmail.com>.

Memory binding can be viewed as opportunity for melding multiple
aggregators.

For instance, any additional aggregation comes nearly for free.  Sum and
count (non zero) will be the same as either alone.  Or sum and sum of
squares.



On Wed, Oct 17, 2018, 06:21 Francois Saint-Jacques <
fsaintjacques@networkdump.com> wrote:

> Don't trust that the compiler will auto-vectorize, tiny changes can have
> drastic impacts.
>
> SumScalar (benchmark-native):
>    │           state->total += *values++;
> 42 │       add    (%rcx),%esi
> 68 │       mov    %esi,(%rsp)
>  3 │       add    0x4(%rcx),%esi
> 36 │       mov    %esi,(%rsp)
>    │       add    0x8(%rcx),%esi
> 86 │       mov    %esi,(%rsp)
>    │       add    0xc(%rcx),%esi
> 51 │       mov    %esi,(%rsp)
> 27 │       add    0x10(%rcx),%esi
> 05 │       mov    %esi,(%rsp)
> 45 │       add    0x14(%rcx),%esi
> 22 │       mov    %esi,(%rsp)
>  2 │       add    0x18(%rcx),%esi
> 68 │       mov    %esi,(%rsp)
>  1 │       add    0x1c(%rcx),%esi
>  1 │       add    $0x20,%rcx
> 54 │       mov    %esi,(%rsp)
>
> SumScalarRegister (benchmark-native):
>        │           sum += *values++;
>
>
>
>
>    228 │       vpaddd (%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    164 │       vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    163 │       vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    438 │       vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
>    207 │       vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    146 │       vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    176 │       vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    447 │       vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
>    210 │       vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    161 │       vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    173 │       vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    392 │       vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3
>
>
>
>
>    217 │       vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0
>
>
>
>
>    153 │       vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1
>
>
>
>
>    156 │       vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2
>
>
>
>
>    654 │       vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3
>
> perf strongly indicate that is highly memory bound, I would
> expect/guesstimate the bitmap version to be slower when the memory
> bandwidth is at full usage (multiple threads computing vectors?) which
> might not show in a single thread benchmark.
>
> On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <we...@gmail.com> wrote:
>
> > I'm surprised that using a stack variable has an impact, but I should
> > probably update the benchmarks to do that (and merge with the
> > SumState<T> at the end of the function) for thoroughness. Thanks!
> > On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
> > <fs...@networkdump.com> wrote:
> > >
> > > It seems the code for the naive Scalar example is not friendly with the
> > > compiler auto-vectorization component. If you accumulate in a local
> state
> > > (instead of SumState pointer), you'll get different results. at least
> > with
> > > clang++6.0.
> > >
> > > benchmark-noavx (only SSE):
> > > BM_SumInt32Scalar                      42390 us      42389 us
>  16
> > > bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> > > BM_SumInt32ScalarRegister              28036 us      28035 us
>  25
> > > bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> > > BM_SumInt32ScalarUnrolled              31660 us      31659 us
>  22
> > > bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> > > BM_SumInt32ScalarSentinel              49541 us      49540 us
>  14
> > > bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> > > BM_SumInt32ScalarSentinelRegister      55655 us      55654 us
>  12
> > > bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> > > BM_SumInt64Scalar                      65030 us      65030 us
>  11
> > > bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> > > BM_SumInt64ScalarRegister              55370 us      55369 us
>  13
> > > bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> > > BM_SumInt64ScalarUnrolled              61287 us      61286 us
>  11
> > > bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> > > BM_SumInt64ScalarSentinel              69792 us      69792 us
>  10
> > > bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> > > BM_SumInt64ScalarSentinelRegister      68418 us      68416 us
>  10
> > > bytes_per_second=10.89G/s items_per_second=1.46164G/
> > >
> > > benchmark-native (AVX512):
> > > BM_SumInt32Scalar                      43981 us      43980 us
>  16
> > > bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> > > BM_SumInt32ScalarRegister              28381 us      28380 us
>  24
> > > bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> > > BM_SumInt32ScalarUnrolled              79035 us      79033 us
> 9
> > > bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> > > BM_SumInt32ScalarSentinel              53430 us      53429 us
>  13
> > > bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> > > BM_SumInt32ScalarSentinelRegister      32450 us      32450 us
>  22
> > > bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> > > BM_SumInt64Scalar                      68112 us      68111 us
>  10
> > > bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> > > BM_SumInt64ScalarRegister              56188 us      56187 us
>  12
> > > bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> > > BM_SumInt64ScalarUnrolled              65649 us      65648 us
>  11
> > > bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> > > BM_SumInt64ScalarSentinel              69674 us      69673 us
>  10
> > > bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> > > BM_SumInt64ScalarSentinelRegister      59200 us      59199 us
>  12
> > > bytes_per_second=12.5857G/s items_per_second=1.68922G/s
> > >
> > > The interesting part here is that the ScalarSentinelRegister version is
> > > very close the the fully vectorized ScalarRegister version on AVX512.
> > > Antoine, the compiler is doing exactly what you're suggesting:
> > >
> > > 374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9
> > >
> > >  48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10
> > >
> > > 230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11
> > >
> > >     │       static inline bool NotNull(int64_t val) { return val !=
> > > null_sentinel; }
> > >     │       vpcmpneqq %ymm13,%ymm8,%k2
> > >
> > >  25 │       vpcmpneqq %ymm13,%ymm9,%k3
> > >
> > >  20 │       vpcmpneqq %ymm13,%ymm10,%k4
> > >
> > >  72 │       vpcmpneqq %ymm13,%ymm11,%k1
> > >
> > >     │           if (Traits<T>::NotNull(*values)) {
> > >
> > >     │       vpbroadcastq %rbx,%ymm12{%k2}{z}
> > >
> > >  15 │       vpaddq %ymm12,%ymm3,%ymm3
> > >
> > >   9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}
> > >
> > >  57 │       vpaddq %ymm12,%ymm5,%ymm5
> > >
> > >     │       vpbroadcastq %rbx,%ymm12{%k4}{z}
> > >
> > >  39 │       vpaddq %ymm12,%ymm6,%ymm6
> > >
> > >  22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}
> > >
> > >  88 │       vpaddq %ymm12,%ymm7,%ymm7
> > >
> > > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> > > didn't have time to look at the other Sum (bitmap) implementation and
> > > didn't have time to clean the code. I'll try improve this over the
> > weekend.
> > >
> > > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi folks,
> > > >
> > > > I explored a bit the performance implications of using validity
> > > > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > > > NaN, INT32_MIN) for nulls:
> > > >
> > > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > > >
> > > > The vectorization results may be of interest to those implementing
> > > > analytic functions targeting the Arrow memory format. There's
> probably
> > > > some other optimizations that can be employed, too.
> > > >
> > > > Caveat: it's entirely possible I made some mistakes in my code. I
> > > > checked the various implementations for correctness only, and did not
> > > > dig too deeply beyond that.
> > > >
> > > > - Wes
> > > >
> > >
> > >
> > > --
> > > Sent from my jetpack.
> >
>
>
> --
> Sent from my jetpack.
>

Re: Algorithmic explorations of bitmaps vs. sentinel values

Posted by Francois Saint-Jacques <fs...@networkdump.com>.

Don't trust that the compiler will auto-vectorize, tiny changes can have
drastic impacts.

SumScalar (benchmark-native):
   │           state->total += *values++;
42 │       add    (%rcx),%esi
68 │       mov    %esi,(%rsp)
 3 │       add    0x4(%rcx),%esi
36 │       mov    %esi,(%rsp)
   │       add    0x8(%rcx),%esi
86 │       mov    %esi,(%rsp)
   │       add    0xc(%rcx),%esi
51 │       mov    %esi,(%rsp)
27 │       add    0x10(%rcx),%esi
05 │       mov    %esi,(%rsp)
45 │       add    0x14(%rcx),%esi
22 │       mov    %esi,(%rsp)
 2 │       add    0x18(%rcx),%esi
68 │       mov    %esi,(%rsp)
 1 │       add    0x1c(%rcx),%esi
 1 │       add    $0x20,%rcx
54 │       mov    %esi,(%rsp)

SumScalarRegister (benchmark-native):
       │           sum += *values++;




   228 │       vpaddd (%rcx,%rdi,4),%zmm0,%zmm0




   164 │       vpaddd 0x40(%rcx,%rdi,4),%zmm1,%zmm1




   163 │       vpaddd 0x80(%rcx,%rdi,4),%zmm2,%zmm2




   438 │       vpaddd 0xc0(%rcx,%rdi,4),%zmm3,%zmm3




   207 │       vpaddd 0x100(%rcx,%rdi,4),%zmm0,%zmm0




   146 │       vpaddd 0x140(%rcx,%rdi,4),%zmm1,%zmm1




   176 │       vpaddd 0x180(%rcx,%rdi,4),%zmm2,%zmm2




   447 │       vpaddd 0x1c0(%rcx,%rdi,4),%zmm3,%zmm3




   210 │       vpaddd 0x200(%rcx,%rdi,4),%zmm0,%zmm0




   161 │       vpaddd 0x240(%rcx,%rdi,4),%zmm1,%zmm1




   173 │       vpaddd 0x280(%rcx,%rdi,4),%zmm2,%zmm2




   392 │       vpaddd 0x2c0(%rcx,%rdi,4),%zmm3,%zmm3




   217 │       vpaddd 0x300(%rcx,%rdi,4),%zmm0,%zmm0




   153 │       vpaddd 0x340(%rcx,%rdi,4),%zmm1,%zmm1




   156 │       vpaddd 0x380(%rcx,%rdi,4),%zmm2,%zmm2




   654 │       vpaddd 0x3c0(%rcx,%rdi,4),%zmm3,%zmm3

perf strongly indicate that is highly memory bound, I would
expect/guesstimate the bitmap version to be slower when the memory
bandwidth is at full usage (multiple threads computing vectors?) which
might not show in a single thread benchmark.

On Wed, Oct 17, 2018 at 9:11 AM Wes McKinney <we...@gmail.com> wrote:

> I'm surprised that using a stack variable has an impact, but I should
> probably update the benchmarks to do that (and merge with the
> SumState<T> at the end of the function) for thoroughness. Thanks!
> On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
> <fs...@networkdump.com> wrote:
> >
> > It seems the code for the naive Scalar example is not friendly with the
> > compiler auto-vectorization component. If you accumulate in a local state
> > (instead of SumState pointer), you'll get different results. at least
> with
> > clang++6.0.
> >
> > benchmark-noavx (only SSE):
> > BM_SumInt32Scalar                      42390 us      42389 us         16
> > bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> > BM_SumInt32ScalarRegister              28036 us      28035 us         25
> > bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> > BM_SumInt32ScalarUnrolled              31660 us      31659 us         22
> > bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> > BM_SumInt32ScalarSentinel              49541 us      49540 us         14
> > bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> > BM_SumInt32ScalarSentinelRegister      55655 us      55654 us         12
> > bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> > BM_SumInt64Scalar                      65030 us      65030 us         11
> > bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> > BM_SumInt64ScalarRegister              55370 us      55369 us         13
> > bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> > BM_SumInt64ScalarUnrolled              61287 us      61286 us         11
> > bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> > BM_SumInt64ScalarSentinel              69792 us      69792 us         10
> > bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> > BM_SumInt64ScalarSentinelRegister      68418 us      68416 us         10
> > bytes_per_second=10.89G/s items_per_second=1.46164G/
> >
> > benchmark-native (AVX512):
> > BM_SumInt32Scalar                      43981 us      43980 us         16
> > bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> > BM_SumInt32ScalarRegister              28381 us      28380 us         24
> > bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> > BM_SumInt32ScalarUnrolled              79035 us      79033 us          9
> > bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> > BM_SumInt32ScalarSentinel              53430 us      53429 us         13
> > bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> > BM_SumInt32ScalarSentinelRegister      32450 us      32450 us         22
> > bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> > BM_SumInt64Scalar                      68112 us      68111 us         10
> > bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> > BM_SumInt64ScalarRegister              56188 us      56187 us         12
> > bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> > BM_SumInt64ScalarUnrolled              65649 us      65648 us         11
> > bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> > BM_SumInt64ScalarSentinel              69674 us      69673 us         10
> > bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> > BM_SumInt64ScalarSentinelRegister      59200 us      59199 us         12
> > bytes_per_second=12.5857G/s items_per_second=1.68922G/s
> >
> > The interesting part here is that the ScalarSentinelRegister version is
> > very close the the fully vectorized ScalarRegister version on AVX512.
> > Antoine, the compiler is doing exactly what you're suggesting:
> >
> > 374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9
> >
> >  48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10
> >
> > 230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11
> >
> >     │       static inline bool NotNull(int64_t val) { return val !=
> > null_sentinel; }
> >     │       vpcmpneqq %ymm13,%ymm8,%k2
> >
> >  25 │       vpcmpneqq %ymm13,%ymm9,%k3
> >
> >  20 │       vpcmpneqq %ymm13,%ymm10,%k4
> >
> >  72 │       vpcmpneqq %ymm13,%ymm11,%k1
> >
> >     │           if (Traits<T>::NotNull(*values)) {
> >
> >     │       vpbroadcastq %rbx,%ymm12{%k2}{z}
> >
> >  15 │       vpaddq %ymm12,%ymm3,%ymm3
> >
> >   9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}
> >
> >  57 │       vpaddq %ymm12,%ymm5,%ymm5
> >
> >     │       vpbroadcastq %rbx,%ymm12{%k4}{z}
> >
> >  39 │       vpaddq %ymm12,%ymm6,%ymm6
> >
> >  22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}
> >
> >  88 │       vpaddq %ymm12,%ymm7,%ymm7
> >
> > See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> > didn't have time to look at the other Sum (bitmap) implementation and
> > didn't have time to clean the code. I'll try improve this over the
> weekend.
> >
> > On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi folks,
> > >
> > > I explored a bit the performance implications of using validity
> > > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > > NaN, INT32_MIN) for nulls:
> > >
> > > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > >
> > > The vectorization results may be of interest to those implementing
> > > analytic functions targeting the Arrow memory format. There's probably
> > > some other optimizations that can be employed, too.
> > >
> > > Caveat: it's entirely possible I made some mistakes in my code. I
> > > checked the various implementations for correctness only, and did not
> > > dig too deeply beyond that.
> > >
> > > - Wes
> > >
> >
> >
> > --
> > Sent from my jetpack.
>


-- 
Sent from my jetpack.

Re: Algorithmic explorations of bitmaps vs. sentinel values

Posted by Wes McKinney <we...@gmail.com>.

I'm surprised that using a stack variable has an impact, but I should
probably update the benchmarks to do that (and merge with the
SumState<T> at the end of the function) for thoroughness. Thanks!
On Wed, Oct 17, 2018 at 9:07 AM Francois Saint-Jacques
<fs...@networkdump.com> wrote:
>
> It seems the code for the naive Scalar example is not friendly with the
> compiler auto-vectorization component. If you accumulate in a local state
> (instead of SumState pointer), you'll get different results. at least with
> clang++6.0.
>
> benchmark-noavx (only SSE):
> BM_SumInt32Scalar                      42390 us      42389 us         16
> bytes_per_second=8.78824G/s items_per_second=2.35908G/s
> BM_SumInt32ScalarRegister              28036 us      28035 us         25
> bytes_per_second=13.2878G/s items_per_second=3.56691G/s
> BM_SumInt32ScalarUnrolled              31660 us      31659 us         22
> bytes_per_second=11.7668G/s items_per_second=3.15863G/s
> BM_SumInt32ScalarSentinel              49541 us      49540 us         14
> bytes_per_second=7.51971G/s items_per_second=2.01856G/s
> BM_SumInt32ScalarSentinelRegister      55655 us      55654 us         12
> bytes_per_second=6.69369G/s items_per_second=1.79683G/s
> BM_SumInt64Scalar                      65030 us      65030 us         11
> bytes_per_second=11.4572G/s items_per_second=1.53776G/s
> BM_SumInt64ScalarRegister              55370 us      55369 us         13
> bytes_per_second=13.4563G/s items_per_second=1.80607G/s
> BM_SumInt64ScalarUnrolled              61287 us      61286 us         11
> bytes_per_second=12.1571G/s items_per_second=1.6317G/s
> BM_SumInt64ScalarSentinel              69792 us      69792 us         10
> bytes_per_second=10.6755G/s items_per_second=1.43284G/s
> BM_SumInt64ScalarSentinelRegister      68418 us      68416 us         10
> bytes_per_second=10.89G/s items_per_second=1.46164G/
>
> benchmark-native (AVX512):
> BM_SumInt32Scalar                      43981 us      43980 us         16
> bytes_per_second=8.47046G/s items_per_second=2.27377G/s
> BM_SumInt32ScalarRegister              28381 us      28380 us         24
> bytes_per_second=13.1263G/s items_per_second=3.52355G/s
> BM_SumInt32ScalarUnrolled              79035 us      79033 us          9
> bytes_per_second=4.71361G/s items_per_second=1.2653G/s
> BM_SumInt32ScalarSentinel              53430 us      53429 us         13
> bytes_per_second=6.97235G/s items_per_second=1.87163G/s
> BM_SumInt32ScalarSentinelRegister      32450 us      32450 us         22
> bytes_per_second=11.4802G/s items_per_second=3.0817G/s
> BM_SumInt64Scalar                      68112 us      68111 us         10
> bytes_per_second=10.9389G/s items_per_second=1.46819G/s
> BM_SumInt64ScalarRegister              56188 us      56187 us         12
> bytes_per_second=13.2603G/s items_per_second=1.77977G/s
> BM_SumInt64ScalarUnrolled              65649 us      65648 us         11
> bytes_per_second=11.3492G/s items_per_second=1.52327G/s
> BM_SumInt64ScalarSentinel              69674 us      69673 us         10
> bytes_per_second=10.6937G/s items_per_second=1.43528G/s
> BM_SumInt64ScalarSentinelRegister      59200 us      59199 us         12
> bytes_per_second=12.5857G/s items_per_second=1.68922G/s
>
> The interesting part here is that the ScalarSentinelRegister version is
> very close the the fully vectorized ScalarRegister version on AVX512.
> Antoine, the compiler is doing exactly what you're suggesting:
>
> 374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9
>
>  48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10
>
> 230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11
>
>     │       static inline bool NotNull(int64_t val) { return val !=
> null_sentinel; }
>     │       vpcmpneqq %ymm13,%ymm8,%k2
>
>  25 │       vpcmpneqq %ymm13,%ymm9,%k3
>
>  20 │       vpcmpneqq %ymm13,%ymm10,%k4
>
>  72 │       vpcmpneqq %ymm13,%ymm11,%k1
>
>     │           if (Traits<T>::NotNull(*values)) {
>
>     │       vpbroadcastq %rbx,%ymm12{%k2}{z}
>
>  15 │       vpaddq %ymm12,%ymm3,%ymm3
>
>   9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}
>
>  57 │       vpaddq %ymm12,%ymm5,%ymm5
>
>     │       vpbroadcastq %rbx,%ymm12{%k4}{z}
>
>  39 │       vpaddq %ymm12,%ymm6,%ymm6
>
>  22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}
>
>  88 │       vpaddq %ymm12,%ymm7,%ymm7
>
> See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
> didn't have time to look at the other Sum (bitmap) implementation and
> didn't have time to clean the code. I'll try improve this over the weekend.
>
> On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > I explored a bit the performance implications of using validity
> > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > NaN, INT32_MIN) for nulls:
> >
> > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >
> > The vectorization results may be of interest to those implementing
> > analytic functions targeting the Arrow memory format. There's probably
> > some other optimizations that can be employed, too.
> >
> > Caveat: it's entirely possible I made some mistakes in my code. I
> > checked the various implementations for correctness only, and did not
> > dig too deeply beyond that.
> >
> > - Wes
> >
>
>
> --
> Sent from my jetpack.

Re: Algorithmic explorations of bitmaps vs. sentinel values

Posted by Francois Saint-Jacques <fs...@networkdump.com>.

It seems the code for the naive Scalar example is not friendly with the
compiler auto-vectorization component. If you accumulate in a local state
(instead of SumState pointer), you'll get different results. at least with
clang++6.0.

benchmark-noavx (only SSE):
BM_SumInt32Scalar                      42390 us      42389 us         16
bytes_per_second=8.78824G/s items_per_second=2.35908G/s
BM_SumInt32ScalarRegister              28036 us      28035 us         25
bytes_per_second=13.2878G/s items_per_second=3.56691G/s
BM_SumInt32ScalarUnrolled              31660 us      31659 us         22
bytes_per_second=11.7668G/s items_per_second=3.15863G/s
BM_SumInt32ScalarSentinel              49541 us      49540 us         14
bytes_per_second=7.51971G/s items_per_second=2.01856G/s
BM_SumInt32ScalarSentinelRegister      55655 us      55654 us         12
bytes_per_second=6.69369G/s items_per_second=1.79683G/s
BM_SumInt64Scalar                      65030 us      65030 us         11
bytes_per_second=11.4572G/s items_per_second=1.53776G/s
BM_SumInt64ScalarRegister              55370 us      55369 us         13
bytes_per_second=13.4563G/s items_per_second=1.80607G/s
BM_SumInt64ScalarUnrolled              61287 us      61286 us         11
bytes_per_second=12.1571G/s items_per_second=1.6317G/s
BM_SumInt64ScalarSentinel              69792 us      69792 us         10
bytes_per_second=10.6755G/s items_per_second=1.43284G/s
BM_SumInt64ScalarSentinelRegister      68418 us      68416 us         10
bytes_per_second=10.89G/s items_per_second=1.46164G/

benchmark-native (AVX512):
BM_SumInt32Scalar                      43981 us      43980 us         16
bytes_per_second=8.47046G/s items_per_second=2.27377G/s
BM_SumInt32ScalarRegister              28381 us      28380 us         24
bytes_per_second=13.1263G/s items_per_second=3.52355G/s
BM_SumInt32ScalarUnrolled              79035 us      79033 us          9
bytes_per_second=4.71361G/s items_per_second=1.2653G/s
BM_SumInt32ScalarSentinel              53430 us      53429 us         13
bytes_per_second=6.97235G/s items_per_second=1.87163G/s
BM_SumInt32ScalarSentinelRegister      32450 us      32450 us         22
bytes_per_second=11.4802G/s items_per_second=3.0817G/s
BM_SumInt64Scalar                      68112 us      68111 us         10
bytes_per_second=10.9389G/s items_per_second=1.46819G/s
BM_SumInt64ScalarRegister              56188 us      56187 us         12
bytes_per_second=13.2603G/s items_per_second=1.77977G/s
BM_SumInt64ScalarUnrolled              65649 us      65648 us         11
bytes_per_second=11.3492G/s items_per_second=1.52327G/s
BM_SumInt64ScalarSentinel              69674 us      69673 us         10
bytes_per_second=10.6937G/s items_per_second=1.43528G/s
BM_SumInt64ScalarSentinelRegister      59200 us      59199 us         12
bytes_per_second=12.5857G/s items_per_second=1.68922G/s

The interesting part here is that the ScalarSentinelRegister version is
very close the the fully vectorized ScalarRegister version on AVX512.
Antoine, the compiler is doing exactly what you're suggesting:

374 │       vmovdqu 0x20(%rcx,%rdi,8),%ymm9

 48 │       vmovdqu 0x40(%rcx,%rdi,8),%ymm10

230 │       vmovdqu 0x60(%rcx,%rdi,8),%ymm11

    │       static inline bool NotNull(int64_t val) { return val !=
null_sentinel; }
    │       vpcmpneqq %ymm13,%ymm8,%k2

 25 │       vpcmpneqq %ymm13,%ymm9,%k3

 20 │       vpcmpneqq %ymm13,%ymm10,%k4

 72 │       vpcmpneqq %ymm13,%ymm11,%k1

    │           if (Traits<T>::NotNull(*values)) {

    │       vpbroadcastq %rbx,%ymm12{%k2}{z}

 15 │       vpaddq %ymm12,%ymm3,%ymm3

  9 │       vpbroadcastq %rbx,%ymm12{%k3}{z}

 57 │       vpaddq %ymm12,%ymm5,%ymm5

    │       vpbroadcastq %rbx,%ymm12{%k4}{z}

 39 │       vpaddq %ymm12,%ymm6,%ymm6

 22 │       vpbroadcastq %rbx,%ymm12{%k1}{z}

 88 │       vpaddq %ymm12,%ymm7,%ymm7

See my fork, https://github.com/fsaintjacques/bitmaps-vs-sentinels . I
didn't have time to look at the other Sum (bitmap) implementation and
didn't have time to clean the code. I'll try improve this over the weekend.

On Tue, Oct 16, 2018 at 8:05 AM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> I explored a bit the performance implications of using validity
> bitmaps (like the Arrow columnar format) vs. sentinel values (like
> NaN, INT32_MIN) for nulls:
>
> http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>
> The vectorization results may be of interest to those implementing
> analytic functions targeting the Arrow memory format. There's probably
> some other optimizations that can be employed, too.
>
> Caveat: it's entirely possible I made some mistakes in my code. I
> checked the various implementations for correctness only, and did not
> dig too deeply beyond that.
>
> - Wes
>


-- 
Sent from my jetpack.

Re: Algorithmic explorations of bitmaps vs. sentinel values

Posted by Antoine Pitrou <an...@python.org>.

Le 16/10/2018 à 18:55, Ted Dunning a écrit :
> It should be possible to unroll the sentinel version in many cases. For
> instance,
> 
>      sum += (data[i] == SENTINEL) * data[i]
> 
> This doesn't work with NaN as a sentinel because 0 * NaN => NaN, but it can
> work with other values.

One should also explore whether the multiplication pattern is really
optimal.  Perhaps a condition would work better.  The compiler may be
able to turn it into a cheap conditional move (to suppress branch
misprediction costs).

Regards

Antoine.


> 
> 
> On Tue, Oct 16, 2018 at 9:38 AM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Hi Wes,
>>
>> Le 16/10/2018 à 14:05, Wes McKinney a écrit :
>>> hi folks,
>>>
>>> I explored a bit the performance implications of using validity
>>> bitmaps (like the Arrow columnar format) vs. sentinel values (like
>>> NaN, INT32_MIN) for nulls:
>>>
>>> http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>
>>> The vectorization results may be of interest to those implementing
>>> analytic functions targeting the Arrow memory format. There's probably
>>> some other optimizations that can be employed, too.
>>
>> This is a nice write-up.  It may also possible to further speed up
>> things using explicit SIMD operations.
>>
>> For the non-null case, it should be relatively doable, see e.g.
>> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/int/reduce_add.html
>> or
>> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/fp/reduce_add.html
>> .
>>
>> For the with-nulls case, it might be possible to do something with SIMD
>> masks, but I'm not competent to propose anything concrete :-)
>>
>> Regards
>>
>> Antoine.
>>
>>
>>>
>>> Caveat: it's entirely possible I made some mistakes in my code. I
>>> checked the various implementations for correctness only, and did not
>>> dig too deeply beyond that.
>>>
>>> - Wes
>>>
>>
>

Re: Algorithmic explorations of bitmaps vs. sentinel values

Posted by Ted Dunning <te...@gmail.com>.

It should be possible to unroll the sentinel version in many cases. For
instance,

     sum += (data[i] == SENTINEL) * data[i]

This doesn't work with NaN as a sentinel because 0 * NaN => NaN, but it can
work with other values.


On Tue, Oct 16, 2018 at 9:38 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi Wes,
>
> Le 16/10/2018 à 14:05, Wes McKinney a écrit :
> > hi folks,
> >
> > I explored a bit the performance implications of using validity
> > bitmaps (like the Arrow columnar format) vs. sentinel values (like
> > NaN, INT32_MIN) for nulls:
> >
> > http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >
> > The vectorization results may be of interest to those implementing
> > analytic functions targeting the Arrow memory format. There's probably
> > some other optimizations that can be employed, too.
>
> This is a nice write-up.  It may also possible to further speed up
> things using explicit SIMD operations.
>
> For the non-null case, it should be relatively doable, see e.g.
> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/int/reduce_add.html
> or
> https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/fp/reduce_add.html
> .
>
> For the with-nulls case, it might be possible to do something with SIMD
> masks, but I'm not competent to propose anything concrete :-)
>
> Regards
>
> Antoine.
>
>
> >
> > Caveat: it's entirely possible I made some mistakes in my code. I
> > checked the various implementations for correctness only, and did not
> > dig too deeply beyond that.
> >
> > - Wes
> >
>

Re: Algorithmic explorations of bitmaps vs. sentinel values

Posted by Antoine Pitrou <an...@python.org>.

Hi Wes,

Le 16/10/2018 à 14:05, Wes McKinney a écrit :
> hi folks,
> 
> I explored a bit the performance implications of using validity
> bitmaps (like the Arrow columnar format) vs. sentinel values (like
> NaN, INT32_MIN) for nulls:
> 
> http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> 
> The vectorization results may be of interest to those implementing
> analytic functions targeting the Arrow memory format. There's probably
> some other optimizations that can be employed, too.

This is a nice write-up.  It may also possible to further speed up
things using explicit SIMD operations.

For the non-null case, it should be relatively doable, see e.g.
https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/int/reduce_add.html
or
https://p12tic.github.io/libsimdpp/v2.2-dev/libsimdpp/w/fp/reduce_add.html .

For the with-nulls case, it might be possible to do something with SIMD
masks, but I'm not competent to propose anything concrete :-)

Regards

Antoine.


> 
> Caveat: it's entirely possible I made some mistakes in my code. I
> checked the various implementations for correctness only, and did not
> dig too deeply beyond that.
> 
> - Wes
>