You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2020/06/09 09:34:59 UTC

[C++][Discuss] Approaches for SIMD optimizations

Hello,

As part of https://github.com/apache/arrow/pull/7314, a discussion
started about our strategy for adding SIMD optimizations to various
routines and kernels.

Currently, we have no defined strategy and we have been adding
hand-written SIMD-optimized functions for particular primitives and
instruction sets, thanks to the submissions of contributors.  For
example, the above PR adds ~500 lines of code for the purpose of
accelerating the SUM kernel, when the input has no nulls, on the SSE
instruction set.

However, it seems that this ad hoc approach may not scale very well.
There are several widely-used SIMD instruction sets out there (the most
common being SSE[2], AVX[2], AVX512, Neon... I suppose ARM SVE will come
into play at some point), and there will be many potential functions to
optimize once we start writing a comprehensive library of computation
kernels.  Adding hand-written implementations, using intrinsic
functions, for each {routine, instruction set} pair threatens to create
a large maintenance burden.

In that PR, I suggested that we instead take a look at the SIMD wrapper
libraries available in C++.  There are several available:
* MIPP (https://github.com/aff3ct/MIPP)
* Vc (https://github.com/VcDevel/Vc)
* libsimdpp (https://github.com/p12tic/libsimdpp)
* (others yet)

In the course of the discussion, an interesting paper was mentioned:
https://dl.acm.org/doi/pdf/10.1145/3178433.3178435
together with an implementation comparison of a simple function:
https://gitlab.inria.fr/acassagn/mandelbrot

The SIMD wrappers met skepticism from Frank, the PR submitter, on the
basis that performance may not be optimal and that not all desired
features may be provided (such as runtime dispatching).

However, we also have to account that, without a wrapper library, we
will probably only integrate and maintain a small fraction of the
optimized routines that would be otherwise possible with a more
abstracted approach.  So, while the hand-written approach can be better
on a single {routine, instruction set} pair, it may lead to a globally
suboptimal situation (that is, unless the number of full-time developers
and maintainers on Arrow C++ inflates significantly).

Personally, I would like interested developers and contributors (such as
Micah, Frank, Yibo Cai) to hash out the various possible approaches, and
propose a way forward (which may be hybrid).

Regards

Antoine.


Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Yibo Cai <yi...@arm.com>.
I did a quick investigation of libsimdpp, google highway[1], and Vc(std-simd)[2].

I tried rewriting simd utf8 validation code by unifying sse and neon intrinsics with libsimdpp wrappers. You can compare simdpp/sse/neon code[3].
Utf8 validation is non-trivial, but the porting is straightforward, easier than I thought. And simd wrappers unified mysterious names `_mm_shuffle_epi8`, `vqtbl1q_u8` to more friendly names like `simdpp::permute_bytes16` or `hwy::TableLookupBytes`.
But I failed finally due to some gaps of neon and sse4. Neon `tbl` supports lookup multiple tables, and return 0 if index is out of bound[4]. Sse4 `pshufb` lookup one table, and out of bound indices are handled much more complex[5]. In this specific code, neon behaviour is convenient. To unify the code, I have to abandon neon feature and sacrifice performance on arm. Of course it's also possible to improve libsimdpp/highway.

I think this is common for all simd wrappers. It can unify most code if vector length is the same. But there are always cases where arch dependent code is necessary, such as above example or advanced features like avx512 mask, which are not well supported by simd wrappers.

About performance, as described in google highway design philosophy[6], it achieves portability, maintainabilty and readability by sacrificing 10-20% performance. Sounds fair.

libsimdpp and highway look like mature products, claims to support gcc/clang/msvc, c++11, x86/arm/ppc. std-simd only supports gcc-9+, and arm/ppc support is poor now.

That said, I don't think leveraging simd wrapper alone will fix our problem, especially the code size, both source and binary. They are just shallow wrappers to intrinsics with more friendly api.

Personally, I prefer apply simd only to subroutines(e.g. the sum loop), not the whole kernel. It's simpler, of course we need to prevent exploding ifdef.
Simd kernel shares many common code with base kernel, moving the code to xxx_internal.h makes base kernel harder to read.

Besides, as Wes commented[7], it's better to put simd code in a standalone shared lib. Binary may explodes quickly, we will need #{i8,i16,i32,i64,float,double} * #{sse,avx,avx512 | neon,sve} simd code instances for a simple sum operation, though simd wrapper may help reducing source code size by its carefully designed templates.

[1] https://github.com/google/highway
[2] https://github.com/VcDevel/std-simd
[3] simdpp: https://github.com/cyb70289/utf8/blob/simdpp/range-simdpp.cc
     sse: https://github.com/cyb70289/utf8/blob/simdpp/range-sse.c
     neon: https://github.com/cyb70289/utf8/blob/simdpp/range-neon.c
[4] https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vqtbl2q_u8
[5] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_shuffle_epi8&expand=5153
[6] https://github.com/google/highway#design-philosophy
[7] https://github.com/apache/arrow/pull/7314#issuecomment-638972317

Yibo

On 6/9/20 5:34 PM, Antoine Pitrou wrote:
> 
> Hello,
> 
> As part of https://github.com/apache/arrow/pull/7314, a discussion
> started about our strategy for adding SIMD optimizations to various
> routines and kernels.
> 
> Currently, we have no defined strategy and we have been adding
> hand-written SIMD-optimized functions for particular primitives and
> instruction sets, thanks to the submissions of contributors.  For
> example, the above PR adds ~500 lines of code for the purpose of
> accelerating the SUM kernel, when the input has no nulls, on the SSE
> instruction set.
> 
> However, it seems that this ad hoc approach may not scale very well.
> There are several widely-used SIMD instruction sets out there (the most
> common being SSE[2], AVX[2], AVX512, Neon... I suppose ARM SVE will come
> into play at some point), and there will be many potential functions to
> optimize once we start writing a comprehensive library of computation
> kernels.  Adding hand-written implementations, using intrinsic
> functions, for each {routine, instruction set} pair threatens to create
> a large maintenance burden.
> 
> In that PR, I suggested that we instead take a look at the SIMD wrapper
> libraries available in C++.  There are several available:
> * MIPP (https://github.com/aff3ct/MIPP)
> * Vc (https://github.com/VcDevel/Vc)
> * libsimdpp (https://github.com/p12tic/libsimdpp)
> * (others yet)
> 
> In the course of the discussion, an interesting paper was mentioned:
> https://dl.acm.org/doi/pdf/10.1145/3178433.3178435
> together with an implementation comparison of a simple function:
> https://gitlab.inria.fr/acassagn/mandelbrot
> 
> The SIMD wrappers met skepticism from Frank, the PR submitter, on the
> basis that performance may not be optimal and that not all desired
> features may be provided (such as runtime dispatching).
> 
> However, we also have to account that, without a wrapper library, we
> will probably only integrate and maintain a small fraction of the
> optimized routines that would be otherwise possible with a more
> abstracted approach.  So, while the hand-written approach can be better
> on a single {routine, instruction set} pair, it may lead to a globally
> suboptimal situation (that is, unless the number of full-time developers
> and maintainers on Arrow C++ inflates significantly).
> 
> Personally, I would like interested developers and contributors (such as
> Micah, Frank, Yibo Cai) to hash out the various possible approaches, and
> propose a way forward (which may be hybrid).
> 
> Regards
> 
> Antoine.
> 

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Micah Kornfield <em...@gmail.com>.
>
> I agree that complier should generate good vectorize for the non-null data
> part but in fact it didn't,  jedbrown point to it can force complier to
> SIMD using some additional pragmas, something like "#pragma omp simd
> reduction(+:sum)"

It is an interesting question why.  We won't always be able to rely on the
compiler but if it does something unexpected, I'm not sure the best thing
is to jump to intrinsics.

In this case I think most of the gain could be done  by adjusting: "
constexpr int64_t kRoundFactor = 8; " [1]

To be constexpr int64_t kRoundFactor = SIMD_REGISTER_SIZE / sizeof(Type);

[1]
https://github.com/apache/arrow/blob/efb707a5438380dcef78418668b57c3f60592a23/cpp/src/arrow/compute/kernels/aggregate_basic.cc#L143

On Tue, Jun 9, 2020 at 11:04 PM Du, Frank <fr...@intel.com> wrote:

> The PR I committed provide a basic support for runtime dispatching. I
> agree that complier should generate good vectorize for the non-null data
> part but in fact it didn't,  jedbrown point to it can force complier to
> SIMD using some additional pragmas, something like "#pragma omp simd
> reduction(+:sum)", I will try this pragma later but need figure out if it
> need a linking against OpenMP. As I said in the PR, the next step is to
> provide acceleration for nullable data part which is more typical in real
> world and hard to vectorize by compiler. The nullable path of manual
> intrinsic is very easy for AVX512 thanks to native support of mask[1]. I
> has some initial try on SSE path locally and conclude no much gain can be
> achieved, but I would expect it will be totally different for AVX2 as more
> calculation bandwidth provide by AVX2. Consider most recent x86 hardware
> has avx2 support already thus I can remove the SSE intrinsic path anyway to
> reduce one burden.
>
> For the SIMD wrapper, it seems popular compute library(Numpy, openblas,
> etc.) are using intrinsic directly also. I heard numpy is trying to unify a
> single interface but still struggle for many reasons, the hardware provide
> similar interface but still too many difference in detail.
>
> [1] https://en.wikipedia.org/wiki/AVX-512#Opmask_registers
>
> Thanks,
> Frank
>
> -----Original Message-----
> From: Micah Kornfield <em...@gmail.com>
> Sent: Wednesday, June 10, 2020 12:38 PM
> To: dev <de...@arrow.apache.org>
> Subject: Re: [C++][Discuss] Approaches for SIMD optimizations
>
> A few thoughts on this as a high level:
> 1.  Most of the libraries don't support runtime dispatch (libsimdpp seems
> to be the exception here), so we should decide if we want to roll our own
> dynamic dispatch mechanism.
> 2.  It isn't clear to me in the linked PR if the performance delta between
> SIMD generated code and what the compiler would generate.  For simple
> aggregates of non-null data I would expect pretty good auto-vectorization.
> Compiler auto-vectorization seems to get better over time.  For instance
> the scalar example linked in the paper seems to get vectorized somewhat
> under Clang 10 (https://godbolt.org/z/oPopQL).
> 3.  It appears there are some efforts to make a standardized C++ library
> [1] which might be based on Vc.
>
> My initial thought on this is that in the short-term would be to focus on
> the dynamic dispatch question (continue to build our own vs adopt an
> existing library) and lean the compiler for most vectorization. Using
> intrinsics should be limited to complex numerical functions and places
> where the compiler fails to vectorize/translate well (e.g. bit
> manipulations).
>
> If we do find the need for a dedicated library I would lean towards
> something that will converge to a standard to reduce additional
> dependencies in the long run. That being said most of these libraries seem
> to be header only so the dependency is fairly light-weight, so we can
> vendor them if need-be.
>
> [1] https://en.cppreference.com/w/cpp/experimental/simd
>
>
>
>
>
> On Tue, Jun 9, 2020 at 3:32 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Thank you.  xsimd used to require C++14, but apparently they have
> > demoted it to C++11.  Good!
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 09/06/2020 à 12:04, Maarten Breddels a écrit :
> > > Hi Antoine,
> > >
> > > Adding xsimd to the list of options:
> > >  * https://github.com/xtensor-stack/xsimd
> > > Not sure how it compares to the rest though.
> > >
> > > cheers,
> > >
> > > Maarten
> > >
> >
>

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Yibo Cai <yi...@arm.com>.
On 6/12/20 2:30 PM, Micah Kornfield wrote:
> Hi Frank,
> Are the performance numbers you published for the baseline directly from
> master?  I'd like to look at this over the next few days to see if I can
> figure out what is going on.
> 
> To all:
> I'd like to make sure we flush out things to consider in general, for a
> path forward.
> 
> My take on this is we should still prefer writing code in this order:
> 1.  Plain-old C++
> 2.  SIMD Wrapper library (my preference would be towards something that is
> going to be standardized eventually to limit 3P dependencies.  I think the
> counter argument here is if any of the libraries mentioned above has much
> better feature coverage on advanced instruction sets).  Please chime in if
> there are other things to consider.  We should have some rubrics for when
> to make use of the library (i.e. what performance gain do we get on a
> workload).
> 3.  Native CPU intrinsics.  We should develop a rubric for when to accept
> PRs for this.  This should include:
>         1.  Performance gain.
>         2.  General popularity of the architecture.
> 
> For dynamic dispatch:
> I think we should probably continue down the path of building our own.  I
> looked more at libsimdpp's implementation and it might be something we can
> use for guidance, but as it stands, it doesn't seem to have hooks based on
> CPU manufacturer, which for BMI2 intrinsics would be a requirement.  The
> alternative would be to ban BMI2 intrinsics from the code (this might not
> be a bad idea to limit complexity in general).
> 
> Thoughts?
> 

I think it's a good path forward. Thanks Micah.

> Thanks,
> Micah
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Wed, Jun 10, 2020 at 8:35 PM Du, Frank <fr...@intel.com> wrote:
> 
>> Thanks Jed.
>>
>> I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE
>> build(-msse4.2)
>>
>> [Unroll baseline]
>>      for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
>>        for (int64_t k = 0; k < kRoundFactor; k++) {
>>          sum_rounded[k] += values[i + k];
>>        }
>>      }
>> SumKernelFloat/32768/0        2.91 us         2.90 us       239992
>> bytes_per_second=10.5063G/s null_percent=0 size=32.768k
>> SumKernelDouble/32768/0       1.89 us         1.89 us       374470
>> bytes_per_second=16.1847G/s null_percent=0 size=32.768k
>> SumKernelInt8/32768/0         11.6 us         11.6 us        60329
>> bytes_per_second=2.63274G/s null_percent=0 size=32.768k
>> SumKernelInt16/32768/0        6.98 us         6.98 us       100293
>> bytes_per_second=4.3737G/s null_percent=0 size=32.768k
>> SumKernelInt32/32768/0        3.89 us         3.88 us       180423
>> bytes_per_second=7.85862G/s null_percent=0 size=32.768k
>> SumKernelInt64/32768/0        1.86 us         1.85 us       380477
>> bytes_per_second=16.4536G/s null_percent=0 size=32.768k
>>
>> [#pragma omp simd reduction(+:sum)]
>> #pragma omp simd reduction(+:sum)
>>      for (int64_t i = 0; i < n; i++)
>>          sum += values[i];
>> SumKernelFloat/32768/0        2.97 us         2.96 us       235686
>> bytes_per_second=10.294G/s null_percent=0 size=32.768k
>> SumKernelDouble/32768/0       2.97 us         2.97 us       236456
>> bytes_per_second=10.2875G/s null_percent=0 size=32.768k
>> SumKernelInt8/32768/0         11.7 us         11.7 us        60006
>> bytes_per_second=2.61643G/s null_percent=0 size=32.768k
>> SumKernelInt16/32768/0        5.47 us         5.47 us       127999
>> bytes_per_second=5.58002G/s null_percent=0 size=32.768k
>> SumKernelInt32/32768/0        2.42 us         2.41 us       290635
>> bytes_per_second=12.6485G/s null_percent=0 size=32.768k
>> SumKernelInt64/32768/0        1.82 us         1.82 us       386749
>> bytes_per_second=16.7733G/s null_percent=0 size=32.768k
>>
>> [SSE intrinsic]
>> SumKernelFloat/32768/0        2.24 us         2.24 us       310914
>> bytes_per_second=13.6335G/s null_percent=0 size=32.768k
>> SumKernelDouble/32768/0       1.43 us         1.43 us       486642
>> bytes_per_second=21.3266G/s null_percent=0 size=32.768k
>> SumKernelInt8/32768/0         6.93 us         6.92 us       100720
>> bytes_per_second=4.41046G/s null_percent=0 size=32.768k
>> SumKernelInt16/32768/0        3.14 us         3.14 us       222803
>> bytes_per_second=9.72931G/s null_percent=0 size=32.768k
>> SumKernelInt32/32768/0        2.11 us         2.11 us       331388
>> bytes_per_second=14.4907G/s null_percent=0 size=32.768k
>> SumKernelInt64/32768/0        1.32 us         1.32 us       532964
>> bytes_per_second=23.0728G/s null_percent=0 size=32.768k
>>
>> I tried to tweak the kRoundFactor or using some unroll based omp simd, or
>> build with clang-8, unluckily I never can get the results up to intrinsic.
>> The ASM code generated all use SIMD instructions, only some small
>> difference like instruction sequences or xmm register used. The things
>> under compiler is really some secret for me.
>>
>> Thanks,
>> Frank
>>
>> -----Original Message-----
>> From: Jed Brown <je...@jedbrown.org>
>> Sent: Thursday, June 11, 2020 1:58 AM
>> To: Du, Frank <fr...@intel.com>; dev@arrow.apache.org
>> Subject: RE: [C++][Discuss] Approaches for SIMD optimizations
>>
>> "Du, Frank" <fr...@intel.com> writes:
>>
>>> The PR I committed provide a basic support for runtime dispatching. I
>>> agree that complier should generate good vectorize for the non-null
>>> data part but in fact it didn't, jedbrown point to it can force
>>> complier to SIMD using some additional pragmas, something like
>>> "#pragma omp simd reduction(+:sum)", I will try this pragma later but
>>> need figure out if it need a linking against OpenMP.
>>
>> It does not require linking OpenMP.  You just compile with -fopenmp-simd
>> (gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd"
>> pragmas.  (These can be captured in macros using _Pragma.)
>>
>> Note that you get automatic vectorization for this sort of thing without
>> any OpenMP if you add -funsafe-math-optimizations (included in -ffast-math).
>>
>>    https://gcc.godbolt.org/z/8thgru
>>
>> Many projects don't want -funsafe-math-optimizations because there are
>> places where it can hurt numerical stability.  ICC includes unsafe math in
>> normal optimization levels while GCC and Clang are more conservative.
>>
> 

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Micah Kornfield <em...@gmail.com>.
I looked into this a little bit more, and it appears GCC 7.5 is a lot more
dependent on the unrolling parameter [1].

When looking at int64_t it appears that the assembly changes dramatically
for GCC7.5 between 16 values unrolling 32 value unrolling. If I read the
assembly correctly, the former keeps everything in registers, the latter
will try to write results back to "sum_rounded" as part of the unrolling.
Clang consistently keeps everything in registers (GCC 9 appears to generate
similar code to Clang 8).

This might be an argument for using a SIMD library even for simpler code
for not being so reliant on magic numbers.

[1] https://godbolt.org/z/DZamCk

On Sat, Jun 20, 2020 at 10:32 PM Micah Kornfield <em...@gmail.com>
wrote:

> I just wanted to follow-up on this.  I played around with the existing
> master branch to do two things [1]:
> 1.  Inform alignment of the vector (the code can't be merged as is because
> it doesn't handle unaligned vectors and needs to have a macro to handle
> both MSVC and GCC/CLANG differences in defining attributes).
> 2.  Adjust the size of batched operations to be 256/sizeof(T).
>
> I did these both at once, but tweaked the batch size more. Alignment might
> not be necessary for the results below (in which case making a PR would be
> much simpler).
>
> Using these results.  I compared against the PR that started this
> discussion [2].
>
> *TL;DR; (detailed results posted below):*
> 1.  The version provided for SSE4.2 is faster without intrinsics (I had to
> hand disable runtime dispatch in the intrinsic PR by commenting out bits of
> the CMakefile to make sure avx2 and avx512 were not used).  I didn't look
> closely at the PR to see if anything was specialized for SSE4.2
> 2.  Intrinsic version is generally better (from 5% to 27%) on GCC 7.5
> using AVX2 (in the past I've found GCC 7.5 doesn't auto-vectorize as well
> as GCC8 or Clang8).
> 3.  Clang8 is equivalent performance except for the Int8 case where the
> intrinsic version is 29% faster.
>
> *Conclusions:*
> 1.  I think we should avoid intrinsics for the PR in question (or at least
> split them into a separate PR so that we can focus on runtime dispatch).  I
> don't think the int8 use-case warrants the complexity of intrinsics (but we
> should potentially investigate other methods to speed it up if we do think
> it is important).  I'm happy to re-assess if people think it is important.
> 2.  We should define which compiler to standardize on for reporting
> performance benchmarks.  Probably this should relate to either our wheel or
> conda toolchains.
>
> Thanks,
> Micah
>
> Detailed performance deltas.  This is run on a GCE instance reporting
> (model name      : Intel(R) Xeon(R) CPU @ 2.20GHz).
>
>
> *GCC  7.5 (SSE4.2)                                                 (%
> reduction in runtime in intrinsic.  + indicates instrinsics is worse)*
>
> SumKernelFloat/32768*/0*                     +0.1950         +0.1950
>
> SumKernelDouble/32768*/0*                    +0.1946         +0.1946
>
> SumKernelInt8/32768*/0*                      -0.0037         -0.0037
>
> SumKernelInt16/32768*/0*                     +0.2803         +0.2804
>
> SumKernelInt32/32768*/0*                     +0.6422         +0.6422
>
> SumKernelInt64/32768*/0*                     +0.2221         +0.2221
>
>
>
> *GCC  7.5 (AVX2)                       (reduction in runtime for
> intrinsics)*
>
> SumKernelFloat/32768*/0*                     -0.1222         -0.1222
>
>
> SumKernelDouble/32768*/0*                    -0.2736         -0.2736
>
> SumKernelInt8/32768*/0*                      -0.2745         -0.2745
>
> SumKernelInt16/32768*/0*                     -0.1047         -0.1047
>
> SumKernelInt32/32768*/0*                     -0.0518         -0.0518
>
> SumKernelInt64/32768*/0*                     -0.2715         -0.2715
>
>
>
>
> *Clang (AVX2)*
>
> SumKernelFloat/32768*/0*                     +0.0134         +0.0134
>
> SumKernelDouble/32768*/0*                    +0.0186         +0.0186
>
> SumKernelInt8/32768*/0*                      -0.2883         -0.2883
>
> SumKernelInt16/32768*/0*                     -0.0243         -0.0243
>
> SumKernelInt32/32768*/0*                     +0.0120         +0.0120
>
> SumKernelInt64/32768*/0*                     -0.0023         -0.0022
>
>
> [1]
> https://github.com/emkornfield/arrow/commit/f8755a046055c0a24736793fac02e1e428854dd8
> [2] https://github.com/apache/arrow/pull/7314/
>
> On Fri, Jun 12, 2020 at 4:45 AM Antoine Pitrou <an...@python.org> wrote:
>
>>
>> Le 12/06/2020 à 08:30, Micah Kornfield a écrit :
>> > Hi Frank,
>> > Are the performance numbers you published for the baseline directly from
>> > master?  I'd like to look at this over the next few days to see if I can
>> > figure out what is going on.
>> >
>> > To all:
>> > I'd like to make sure we flush out things to consider in general, for a
>> > path forward.
>> >
>> > My take on this is we should still prefer writing code in this order:
>> > 1.  Plain-old C++
>>
>> I don't know if that is included morally in your 1), but there's
>> 1.5) C++ with hand-unrolled loops
>>
>> > 3.  Native CPU intrinsics.  We should develop a rubric for when to
>> accept
>> > PRs for this.  This should include:
>> >        1.  Performance gain.
>> >        2.  General popularity of the architecture.
>>
>> This should also include 3) criticality of the optimization.  For
>> example, improving operations on null bitmaps sounds more critical for
>> Arrow than optimizing e.g. the "ascii_lower" kernel.
>>
>> > For dynamic dispatch:
>> > I think we should probably continue down the path of building our own.
>>
>> Agreed.  There seem to be two possible approaches with this:
>> 1) a hardwired list of tests for CPU flags
>> 2) a lazily-cached function pointer
>>
>> Regards
>>
>> Antoine.
>>
>

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Micah Kornfield <em...@gmail.com>.
I just wanted to follow-up on this.  I played around with the existing
master branch to do two things [1]:
1.  Inform alignment of the vector (the code can't be merged as is because
it doesn't handle unaligned vectors and needs to have a macro to handle
both MSVC and GCC/CLANG differences in defining attributes).
2.  Adjust the size of batched operations to be 256/sizeof(T).

I did these both at once, but tweaked the batch size more. Alignment might
not be necessary for the results below (in which case making a PR would be
much simpler).

Using these results.  I compared against the PR that started this
discussion [2].

*TL;DR; (detailed results posted below):*
1.  The version provided for SSE4.2 is faster without intrinsics (I had to
hand disable runtime dispatch in the intrinsic PR by commenting out bits of
the CMakefile to make sure avx2 and avx512 were not used).  I didn't look
closely at the PR to see if anything was specialized for SSE4.2
2.  Intrinsic version is generally better (from 5% to 27%) on GCC 7.5 using
AVX2 (in the past I've found GCC 7.5 doesn't auto-vectorize as well as GCC8
or Clang8).
3.  Clang8 is equivalent performance except for the Int8 case where the
intrinsic version is 29% faster.

*Conclusions:*
1.  I think we should avoid intrinsics for the PR in question (or at least
split them into a separate PR so that we can focus on runtime dispatch).  I
don't think the int8 use-case warrants the complexity of intrinsics (but we
should potentially investigate other methods to speed it up if we do think
it is important).  I'm happy to re-assess if people think it is important.
2.  We should define which compiler to standardize on for reporting
performance benchmarks.  Probably this should relate to either our wheel or
conda toolchains.

Thanks,
Micah

Detailed performance deltas.  This is run on a GCE instance reporting
(model name      : Intel(R) Xeon(R) CPU @ 2.20GHz).


*GCC  7.5 (SSE4.2)                                                 (%
reduction in runtime in intrinsic.  + indicates instrinsics is worse)*

SumKernelFloat/32768*/0*                     +0.1950         +0.1950

SumKernelDouble/32768*/0*                    +0.1946         +0.1946

SumKernelInt8/32768*/0*                      -0.0037         -0.0037

SumKernelInt16/32768*/0*                     +0.2803         +0.2804

SumKernelInt32/32768*/0*                     +0.6422         +0.6422

SumKernelInt64/32768*/0*                     +0.2221         +0.2221



*GCC  7.5 (AVX2)                       (reduction in runtime for
intrinsics)*

SumKernelFloat/32768*/0*                     -0.1222         -0.1222

SumKernelDouble/32768*/0*                    -0.2736         -0.2736

SumKernelInt8/32768*/0*                      -0.2745         -0.2745

SumKernelInt16/32768*/0*                     -0.1047         -0.1047

SumKernelInt32/32768*/0*                     -0.0518         -0.0518

SumKernelInt64/32768*/0*                     -0.2715         -0.2715




*Clang (AVX2)*

SumKernelFloat/32768*/0*                     +0.0134         +0.0134

SumKernelDouble/32768*/0*                    +0.0186         +0.0186

SumKernelInt8/32768*/0*                      -0.2883         -0.2883

SumKernelInt16/32768*/0*                     -0.0243         -0.0243

SumKernelInt32/32768*/0*                     +0.0120         +0.0120

SumKernelInt64/32768*/0*                     -0.0023         -0.0022


[1]
https://github.com/emkornfield/arrow/commit/f8755a046055c0a24736793fac02e1e428854dd8
[2] https://github.com/apache/arrow/pull/7314/

On Fri, Jun 12, 2020 at 4:45 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 12/06/2020 à 08:30, Micah Kornfield a écrit :
> > Hi Frank,
> > Are the performance numbers you published for the baseline directly from
> > master?  I'd like to look at this over the next few days to see if I can
> > figure out what is going on.
> >
> > To all:
> > I'd like to make sure we flush out things to consider in general, for a
> > path forward.
> >
> > My take on this is we should still prefer writing code in this order:
> > 1.  Plain-old C++
>
> I don't know if that is included morally in your 1), but there's
> 1.5) C++ with hand-unrolled loops
>
> > 3.  Native CPU intrinsics.  We should develop a rubric for when to accept
> > PRs for this.  This should include:
> >        1.  Performance gain.
> >        2.  General popularity of the architecture.
>
> This should also include 3) criticality of the optimization.  For
> example, improving operations on null bitmaps sounds more critical for
> Arrow than optimizing e.g. the "ascii_lower" kernel.
>
> > For dynamic dispatch:
> > I think we should probably continue down the path of building our own.
>
> Agreed.  There seem to be two possible approaches with this:
> 1) a hardwired list of tests for CPU flags
> 2) a lazily-cached function pointer
>
> Regards
>
> Antoine.
>

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Antoine Pitrou <an...@python.org>.
Le 12/06/2020 à 08:30, Micah Kornfield a écrit :
> Hi Frank,
> Are the performance numbers you published for the baseline directly from
> master?  I'd like to look at this over the next few days to see if I can
> figure out what is going on.
> 
> To all:
> I'd like to make sure we flush out things to consider in general, for a
> path forward.
> 
> My take on this is we should still prefer writing code in this order:
> 1.  Plain-old C++

I don't know if that is included morally in your 1), but there's
1.5) C++ with hand-unrolled loops

> 3.  Native CPU intrinsics.  We should develop a rubric for when to accept
> PRs for this.  This should include:
>        1.  Performance gain.
>        2.  General popularity of the architecture.

This should also include 3) criticality of the optimization.  For
example, improving operations on null bitmaps sounds more critical for
Arrow than optimizing e.g. the "ascii_lower" kernel.

> For dynamic dispatch:
> I think we should probably continue down the path of building our own.

Agreed.  There seem to be two possible approaches with this:
1) a hardwired list of tests for CPU flags
2) a lazily-cached function pointer

Regards

Antoine.

RE: [C++][Discuss] Approaches for SIMD optimizations

Posted by "Du, Frank" <fr...@intel.com>.
Hi Micah,

Yes, on the latest baseline.

One thing for the SIMD library, can we take account the feature of mask op[1], MIPP[2] support it already. It is very useful as most of arrow function comes with a valid bit map.

[1] https://en.wikipedia.org/wiki/AVX-512#Opmask_registers
[2] https://github.com/aff3ct/MIPP#masked-instructions

Thanks,
Frank

-----Original Message-----
From: Micah Kornfield <em...@gmail.com> 
Sent: Friday, June 12, 2020 2:31 PM
To: dev <de...@arrow.apache.org>
Subject: Re: [C++][Discuss] Approaches for SIMD optimizations

Hi Frank,
Are the performance numbers you published for the baseline directly from master?  I'd like to look at this over the next few days to see if I can figure out what is going on.

To all:
I'd like to make sure we flush out things to consider in general, for a path forward.

My take on this is we should still prefer writing code in this order:
1.  Plain-old C++
2.  SIMD Wrapper library (my preference would be towards something that is going to be standardized eventually to limit 3P dependencies.  I think the counter argument here is if any of the libraries mentioned above has much better feature coverage on advanced instruction sets).  Please chime in if there are other things to consider.  We should have some rubrics for when to make use of the library (i.e. what performance gain do we get on a workload).
3.  Native CPU intrinsics.  We should develop a rubric for when to accept PRs for this.  This should include:
       1.  Performance gain.
       2.  General popularity of the architecture.

For dynamic dispatch:
I think we should probably continue down the path of building our own.  I looked more at libsimdpp's implementation and it might be something we can use for guidance, but as it stands, it doesn't seem to have hooks based on CPU manufacturer, which for BMI2 intrinsics would be a requirement.  The alternative would be to ban BMI2 intrinsics from the code (this might not be a bad idea to limit complexity in general).

Thoughts?

Thanks,
Micah









On Wed, Jun 10, 2020 at 8:35 PM Du, Frank <fr...@intel.com> wrote:

> Thanks Jed.
>
> I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE
> build(-msse4.2)
>
> [Unroll baseline]
>     for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
>       for (int64_t k = 0; k < kRoundFactor; k++) {
>         sum_rounded[k] += values[i + k];
>       }
>     }
> SumKernelFloat/32768/0        2.91 us         2.90 us       239992
> bytes_per_second=10.5063G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       1.89 us         1.89 us       374470
> bytes_per_second=16.1847G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         11.6 us         11.6 us        60329
> bytes_per_second=2.63274G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        6.98 us         6.98 us       100293
> bytes_per_second=4.3737G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        3.89 us         3.88 us       180423
> bytes_per_second=7.85862G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.86 us         1.85 us       380477
> bytes_per_second=16.4536G/s null_percent=0 size=32.768k
>
> [#pragma omp simd reduction(+:sum)]
> #pragma omp simd reduction(+:sum)
>     for (int64_t i = 0; i < n; i++)
>         sum += values[i];
> SumKernelFloat/32768/0        2.97 us         2.96 us       235686
> bytes_per_second=10.294G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       2.97 us         2.97 us       236456
> bytes_per_second=10.2875G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         11.7 us         11.7 us        60006
> bytes_per_second=2.61643G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        5.47 us         5.47 us       127999
> bytes_per_second=5.58002G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        2.42 us         2.41 us       290635
> bytes_per_second=12.6485G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.82 us         1.82 us       386749
> bytes_per_second=16.7733G/s null_percent=0 size=32.768k
>
> [SSE intrinsic]
> SumKernelFloat/32768/0        2.24 us         2.24 us       310914
> bytes_per_second=13.6335G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       1.43 us         1.43 us       486642
> bytes_per_second=21.3266G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         6.93 us         6.92 us       100720
> bytes_per_second=4.41046G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        3.14 us         3.14 us       222803
> bytes_per_second=9.72931G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        2.11 us         2.11 us       331388
> bytes_per_second=14.4907G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.32 us         1.32 us       532964
> bytes_per_second=23.0728G/s null_percent=0 size=32.768k
>
> I tried to tweak the kRoundFactor or using some unroll based omp simd, 
> or build with clang-8, unluckily I never can get the results up to intrinsic.
> The ASM code generated all use SIMD instructions, only some small 
> difference like instruction sequences or xmm register used. The things 
> under compiler is really some secret for me.
>
> Thanks,
> Frank
>
> -----Original Message-----
> From: Jed Brown <je...@jedbrown.org>
> Sent: Thursday, June 11, 2020 1:58 AM
> To: Du, Frank <fr...@intel.com>; dev@arrow.apache.org
> Subject: RE: [C++][Discuss] Approaches for SIMD optimizations
>
> "Du, Frank" <fr...@intel.com> writes:
>
> > The PR I committed provide a basic support for runtime dispatching. 
> > I agree that complier should generate good vectorize for the 
> > non-null data part but in fact it didn't, jedbrown point to it can 
> > force complier to SIMD using some additional pragmas, something like 
> > "#pragma omp simd reduction(+:sum)", I will try this pragma later 
> > but need figure out if it need a linking against OpenMP.
>
> It does not require linking OpenMP.  You just compile with 
> -fopenmp-simd
> (gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd"
> pragmas.  (These can be captured in macros using _Pragma.)
>
> Note that you get automatic vectorization for this sort of thing 
> without any OpenMP if you add -funsafe-math-optimizations (included in -ffast-math).
>
>   https://gcc.godbolt.org/z/8thgru
>
> Many projects don't want -funsafe-math-optimizations because there are 
> places where it can hurt numerical stability.  ICC includes unsafe 
> math in normal optimization levels while GCC and Clang are more conservative.
>

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Micah Kornfield <em...@gmail.com>.
Hi Frank,
Are the performance numbers you published for the baseline directly from
master?  I'd like to look at this over the next few days to see if I can
figure out what is going on.

To all:
I'd like to make sure we flush out things to consider in general, for a
path forward.

My take on this is we should still prefer writing code in this order:
1.  Plain-old C++
2.  SIMD Wrapper library (my preference would be towards something that is
going to be standardized eventually to limit 3P dependencies.  I think the
counter argument here is if any of the libraries mentioned above has much
better feature coverage on advanced instruction sets).  Please chime in if
there are other things to consider.  We should have some rubrics for when
to make use of the library (i.e. what performance gain do we get on a
workload).
3.  Native CPU intrinsics.  We should develop a rubric for when to accept
PRs for this.  This should include:
       1.  Performance gain.
       2.  General popularity of the architecture.

For dynamic dispatch:
I think we should probably continue down the path of building our own.  I
looked more at libsimdpp's implementation and it might be something we can
use for guidance, but as it stands, it doesn't seem to have hooks based on
CPU manufacturer, which for BMI2 intrinsics would be a requirement.  The
alternative would be to ban BMI2 intrinsics from the code (this might not
be a bad idea to limit complexity in general).

Thoughts?

Thanks,
Micah









On Wed, Jun 10, 2020 at 8:35 PM Du, Frank <fr...@intel.com> wrote:

> Thanks Jed.
>
> I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE
> build(-msse4.2)
>
> [Unroll baseline]
>     for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
>       for (int64_t k = 0; k < kRoundFactor; k++) {
>         sum_rounded[k] += values[i + k];
>       }
>     }
> SumKernelFloat/32768/0        2.91 us         2.90 us       239992
> bytes_per_second=10.5063G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       1.89 us         1.89 us       374470
> bytes_per_second=16.1847G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         11.6 us         11.6 us        60329
> bytes_per_second=2.63274G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        6.98 us         6.98 us       100293
> bytes_per_second=4.3737G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        3.89 us         3.88 us       180423
> bytes_per_second=7.85862G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.86 us         1.85 us       380477
> bytes_per_second=16.4536G/s null_percent=0 size=32.768k
>
> [#pragma omp simd reduction(+:sum)]
> #pragma omp simd reduction(+:sum)
>     for (int64_t i = 0; i < n; i++)
>         sum += values[i];
> SumKernelFloat/32768/0        2.97 us         2.96 us       235686
> bytes_per_second=10.294G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       2.97 us         2.97 us       236456
> bytes_per_second=10.2875G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         11.7 us         11.7 us        60006
> bytes_per_second=2.61643G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        5.47 us         5.47 us       127999
> bytes_per_second=5.58002G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        2.42 us         2.41 us       290635
> bytes_per_second=12.6485G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.82 us         1.82 us       386749
> bytes_per_second=16.7733G/s null_percent=0 size=32.768k
>
> [SSE intrinsic]
> SumKernelFloat/32768/0        2.24 us         2.24 us       310914
> bytes_per_second=13.6335G/s null_percent=0 size=32.768k
> SumKernelDouble/32768/0       1.43 us         1.43 us       486642
> bytes_per_second=21.3266G/s null_percent=0 size=32.768k
> SumKernelInt8/32768/0         6.93 us         6.92 us       100720
> bytes_per_second=4.41046G/s null_percent=0 size=32.768k
> SumKernelInt16/32768/0        3.14 us         3.14 us       222803
> bytes_per_second=9.72931G/s null_percent=0 size=32.768k
> SumKernelInt32/32768/0        2.11 us         2.11 us       331388
> bytes_per_second=14.4907G/s null_percent=0 size=32.768k
> SumKernelInt64/32768/0        1.32 us         1.32 us       532964
> bytes_per_second=23.0728G/s null_percent=0 size=32.768k
>
> I tried to tweak the kRoundFactor or using some unroll based omp simd, or
> build with clang-8, unluckily I never can get the results up to intrinsic.
> The ASM code generated all use SIMD instructions, only some small
> difference like instruction sequences or xmm register used. The things
> under compiler is really some secret for me.
>
> Thanks,
> Frank
>
> -----Original Message-----
> From: Jed Brown <je...@jedbrown.org>
> Sent: Thursday, June 11, 2020 1:58 AM
> To: Du, Frank <fr...@intel.com>; dev@arrow.apache.org
> Subject: RE: [C++][Discuss] Approaches for SIMD optimizations
>
> "Du, Frank" <fr...@intel.com> writes:
>
> > The PR I committed provide a basic support for runtime dispatching. I
> > agree that complier should generate good vectorize for the non-null
> > data part but in fact it didn't, jedbrown point to it can force
> > complier to SIMD using some additional pragmas, something like
> > "#pragma omp simd reduction(+:sum)", I will try this pragma later but
> > need figure out if it need a linking against OpenMP.
>
> It does not require linking OpenMP.  You just compile with -fopenmp-simd
> (gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd"
> pragmas.  (These can be captured in macros using _Pragma.)
>
> Note that you get automatic vectorization for this sort of thing without
> any OpenMP if you add -funsafe-math-optimizations (included in -ffast-math).
>
>   https://gcc.godbolt.org/z/8thgru
>
> Many projects don't want -funsafe-math-optimizations because there are
> places where it can hurt numerical stability.  ICC includes unsafe math in
> normal optimization levels while GCC and Clang are more conservative.
>

RE: [C++][Discuss] Approaches for SIMD optimizations

Posted by "Du, Frank" <fr...@intel.com>.
Thanks Jed.

I collect some data on my setup, gcc version 7.5.0, 18.04.4 LTS, SSE build(-msse4.2)

[Unroll baseline]
    for (int64_t i = 0; i < length_rounded; i += kRoundFactor) {
      for (int64_t k = 0; k < kRoundFactor; k++) {
        sum_rounded[k] += values[i + k];
      }
    }
SumKernelFloat/32768/0        2.91 us         2.90 us       239992 bytes_per_second=10.5063G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0       1.89 us         1.89 us       374470 bytes_per_second=16.1847G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0         11.6 us         11.6 us        60329 bytes_per_second=2.63274G/s null_percent=0 size=32.768k
SumKernelInt16/32768/0        6.98 us         6.98 us       100293 bytes_per_second=4.3737G/s null_percent=0 size=32.768k
SumKernelInt32/32768/0        3.89 us         3.88 us       180423 bytes_per_second=7.85862G/s null_percent=0 size=32.768k
SumKernelInt64/32768/0        1.86 us         1.85 us       380477 bytes_per_second=16.4536G/s null_percent=0 size=32.768k

[#pragma omp simd reduction(+:sum)]
#pragma omp simd reduction(+:sum)
    for (int64_t i = 0; i < n; i++)
        sum += values[i];
SumKernelFloat/32768/0        2.97 us         2.96 us       235686 bytes_per_second=10.294G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0       2.97 us         2.97 us       236456 bytes_per_second=10.2875G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0         11.7 us         11.7 us        60006 bytes_per_second=2.61643G/s null_percent=0 size=32.768k
SumKernelInt16/32768/0        5.47 us         5.47 us       127999 bytes_per_second=5.58002G/s null_percent=0 size=32.768k
SumKernelInt32/32768/0        2.42 us         2.41 us       290635 bytes_per_second=12.6485G/s null_percent=0 size=32.768k
SumKernelInt64/32768/0        1.82 us         1.82 us       386749 bytes_per_second=16.7733G/s null_percent=0 size=32.768k

[SSE intrinsic]
SumKernelFloat/32768/0        2.24 us         2.24 us       310914 bytes_per_second=13.6335G/s null_percent=0 size=32.768k
SumKernelDouble/32768/0       1.43 us         1.43 us       486642 bytes_per_second=21.3266G/s null_percent=0 size=32.768k
SumKernelInt8/32768/0         6.93 us         6.92 us       100720 bytes_per_second=4.41046G/s null_percent=0 size=32.768k
SumKernelInt16/32768/0        3.14 us         3.14 us       222803 bytes_per_second=9.72931G/s null_percent=0 size=32.768k
SumKernelInt32/32768/0        2.11 us         2.11 us       331388 bytes_per_second=14.4907G/s null_percent=0 size=32.768k
SumKernelInt64/32768/0        1.32 us         1.32 us       532964 bytes_per_second=23.0728G/s null_percent=0 size=32.768k

I tried to tweak the kRoundFactor or using some unroll based omp simd, or build with clang-8, unluckily I never can get the results up to intrinsic. The ASM code generated all use SIMD instructions, only some small difference like instruction sequences or xmm register used. The things under compiler is really some secret for me.

Thanks,
Frank

-----Original Message-----
From: Jed Brown <je...@jedbrown.org> 
Sent: Thursday, June 11, 2020 1:58 AM
To: Du, Frank <fr...@intel.com>; dev@arrow.apache.org
Subject: RE: [C++][Discuss] Approaches for SIMD optimizations

"Du, Frank" <fr...@intel.com> writes:

> The PR I committed provide a basic support for runtime dispatching. I 
> agree that complier should generate good vectorize for the non-null 
> data part but in fact it didn't, jedbrown point to it can force 
> complier to SIMD using some additional pragmas, something like 
> "#pragma omp simd reduction(+:sum)", I will try this pragma later but 
> need figure out if it need a linking against OpenMP.

It does not require linking OpenMP.  You just compile with -fopenmp-simd
(gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd"
pragmas.  (These can be captured in macros using _Pragma.)

Note that you get automatic vectorization for this sort of thing without any OpenMP if you add -funsafe-math-optimizations (included in -ffast-math).

  https://gcc.godbolt.org/z/8thgru

Many projects don't want -funsafe-math-optimizations because there are places where it can hurt numerical stability.  ICC includes unsafe math in normal optimization levels while GCC and Clang are more conservative.

RE: [C++][Discuss] Approaches for SIMD optimizations

Posted by Jed Brown <je...@jedbrown.org>.
"Du, Frank" <fr...@intel.com> writes:

> The PR I committed provide a basic support for runtime dispatching. I
> agree that complier should generate good vectorize for the non-null
> data part but in fact it didn't, jedbrown point to it can force
> complier to SIMD using some additional pragmas, something like
> "#pragma omp simd reduction(+:sum)", I will try this pragma later but
> need figure out if it need a linking against OpenMP.

It does not require linking OpenMP.  You just compile with -fopenmp-simd
(gcc/clang) or -qopenmp-simd (icc) so that it interprets the "omp simd"
pragmas.  (These can be captured in macros using _Pragma.)

Note that you get automatic vectorization for this sort of thing without
any OpenMP if you add -funsafe-math-optimizations (included in
-ffast-math).

  https://gcc.godbolt.org/z/8thgru

Many projects don't want -funsafe-math-optimizations because there are
places where it can hurt numerical stability.  ICC includes unsafe math
in normal optimization levels while GCC and Clang are more conservative.

RE: [C++][Discuss] Approaches for SIMD optimizations

Posted by "Du, Frank" <fr...@intel.com>.
The PR I committed provide a basic support for runtime dispatching. I agree that complier should generate good vectorize for the non-null data part but in fact it didn't,  jedbrown point to it can force complier to SIMD using some additional pragmas, something like "#pragma omp simd reduction(+:sum)", I will try this pragma later but need figure out if it need a linking against OpenMP. As I said in the PR, the next step is to provide acceleration for nullable data part which is more typical in real world and hard to vectorize by compiler. The nullable path of manual intrinsic is very easy for AVX512 thanks to native support of mask[1]. I has some initial try on SSE path locally and conclude no much gain can be achieved, but I would expect it will be totally different for AVX2 as more calculation bandwidth provide by AVX2. Consider most recent x86 hardware has avx2 support already thus I can remove the SSE intrinsic path anyway to reduce one burden.

For the SIMD wrapper, it seems popular compute library(Numpy, openblas, etc.) are using intrinsic directly also. I heard numpy is trying to unify a single interface but still struggle for many reasons, the hardware provide similar interface but still too many difference in detail. 

[1] https://en.wikipedia.org/wiki/AVX-512#Opmask_registers

Thanks,
Frank

-----Original Message-----
From: Micah Kornfield <em...@gmail.com> 
Sent: Wednesday, June 10, 2020 12:38 PM
To: dev <de...@arrow.apache.org>
Subject: Re: [C++][Discuss] Approaches for SIMD optimizations

A few thoughts on this as a high level:
1.  Most of the libraries don't support runtime dispatch (libsimdpp seems to be the exception here), so we should decide if we want to roll our own dynamic dispatch mechanism.
2.  It isn't clear to me in the linked PR if the performance delta between SIMD generated code and what the compiler would generate.  For simple aggregates of non-null data I would expect pretty good auto-vectorization.
Compiler auto-vectorization seems to get better over time.  For instance the scalar example linked in the paper seems to get vectorized somewhat under Clang 10 (https://godbolt.org/z/oPopQL).
3.  It appears there are some efforts to make a standardized C++ library [1] which might be based on Vc.

My initial thought on this is that in the short-term would be to focus on the dynamic dispatch question (continue to build our own vs adopt an existing library) and lean the compiler for most vectorization. Using intrinsics should be limited to complex numerical functions and places where the compiler fails to vectorize/translate well (e.g. bit manipulations).

If we do find the need for a dedicated library I would lean towards something that will converge to a standard to reduce additional dependencies in the long run. That being said most of these libraries seem to be header only so the dependency is fairly light-weight, so we can vendor them if need-be.

[1] https://en.cppreference.com/w/cpp/experimental/simd





On Tue, Jun 9, 2020 at 3:32 AM Antoine Pitrou <an...@python.org> wrote:

>
> Thank you.  xsimd used to require C++14, but apparently they have 
> demoted it to C++11.  Good!
>
> Regards
>
> Antoine.
>
>
> Le 09/06/2020 à 12:04, Maarten Breddels a écrit :
> > Hi Antoine,
> >
> > Adding xsimd to the list of options:
> >  * https://github.com/xtensor-stack/xsimd
> > Not sure how it compares to the rest though.
> >
> > cheers,
> >
> > Maarten
> >
>

RE: [C++][Discuss] Approaches for SIMD optimizations

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.
> My initial thought on this is that in the short-term would be to focus 
on
> the dynamic dispatch question (continue to build our own vs adopt an
> existing library) and lean the compiler for most vectorization. Using
> intrinsics should be limited to complex numerical functions and places
> where the compiler fails to vectorize/translate well (e.g. bit
> manipulations).

It looks good direction to reduce the number of binaries on one CPU 
architecture (e.g. x86).

Kazuaki Ishizaki



From:   Micah Kornfield <em...@gmail.com>
To:     dev <de...@arrow.apache.org>
Date:   2020/06/10 13:38
Subject:        [EXTERNAL] Re: [C++][Discuss] Approaches for SIMD 
optimizations



A few thoughts on this as a high level:
1.  Most of the libraries don't support runtime dispatch (libsimdpp seems
to be the exception here), so we should decide if we want to roll our own
dynamic dispatch mechanism.
2.  It isn't clear to me in the linked PR if the performance delta between
SIMD generated code and what the compiler would generate.  For simple
aggregates of non-null data I would expect pretty good auto-vectorization.
Compiler auto-vectorization seems to get better over time.  For instance
the scalar example linked in the paper seems to get vectorized somewhat
under Clang 10 (
https://urldefense.proofpoint.com/v2/url?u=https-3A__godbolt.org_z_oPopQL&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg&m=ht-eat3JsBM5dJhLw7VRHoPqIHBsAQqE88_UwgsYfws&s=3mGgzlyoLH0fsp96FfJPMdEQ0R6SWJ0TlcZdoT_9jzw&e= 
).
3.  It appears there are some efforts to make a standardized C++ library
[1] which might be based on Vc.

My initial thought on this is that in the short-term would be to focus on
the dynamic dispatch question (continue to build our own vs adopt an
existing library) and lean the compiler for most vectorization. Using
intrinsics should be limited to complex numerical functions and places
where the compiler fails to vectorize/translate well (e.g. bit
manipulations).

If we do find the need for a dedicated library I would lean towards
something that will converge to a standard to reduce additional
dependencies in the long run. That being said most of these libraries seem
to be header only so the dependency is fairly light-weight, so we can
vendor them if need-be.

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__en.cppreference.com_w_cpp_experimental_simd&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg&m=ht-eat3JsBM5dJhLw7VRHoPqIHBsAQqE88_UwgsYfws&s=xQJVFw4POmC1ssJxbzJVkt4_3WVjgMyKfyZ8SWXHYuc&e= 






On Tue, Jun 9, 2020 at 3:32 AM Antoine Pitrou <an...@python.org> wrote:

>
> Thank you.  xsimd used to require C++14, but apparently they have
> demoted it to C++11.  Good!
>
> Regards
>
> Antoine.
>
>
> Le 09/06/2020 à 12:04, Maarten Breddels a écrit :
> > Hi Antoine,
> >
> > Adding xsimd to the list of options:
> >  * 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_xtensor-2Dstack_xsimd&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg&m=ht-eat3JsBM5dJhLw7VRHoPqIHBsAQqE88_UwgsYfws&s=gJQs85klv9e8HffFnBYp5Fewxwg2TavyBShGb9frWi8&e= 

> > Not sure how it compares to the rest though.
> >
> > cheers,
> >
> > Maarten
> >
>




Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Micah Kornfield <em...@gmail.com>.
A few thoughts on this as a high level:
1.  Most of the libraries don't support runtime dispatch (libsimdpp seems
to be the exception here), so we should decide if we want to roll our own
dynamic dispatch mechanism.
2.  It isn't clear to me in the linked PR if the performance delta between
SIMD generated code and what the compiler would generate.  For simple
aggregates of non-null data I would expect pretty good auto-vectorization.
Compiler auto-vectorization seems to get better over time.  For instance
the scalar example linked in the paper seems to get vectorized somewhat
under Clang 10 (https://godbolt.org/z/oPopQL).
3.  It appears there are some efforts to make a standardized C++ library
[1] which might be based on Vc.

My initial thought on this is that in the short-term would be to focus on
the dynamic dispatch question (continue to build our own vs adopt an
existing library) and lean the compiler for most vectorization. Using
intrinsics should be limited to complex numerical functions and places
where the compiler fails to vectorize/translate well (e.g. bit
manipulations).

If we do find the need for a dedicated library I would lean towards
something that will converge to a standard to reduce additional
dependencies in the long run. That being said most of these libraries seem
to be header only so the dependency is fairly light-weight, so we can
vendor them if need-be.

[1] https://en.cppreference.com/w/cpp/experimental/simd





On Tue, Jun 9, 2020 at 3:32 AM Antoine Pitrou <an...@python.org> wrote:

>
> Thank you.  xsimd used to require C++14, but apparently they have
> demoted it to C++11.  Good!
>
> Regards
>
> Antoine.
>
>
> Le 09/06/2020 à 12:04, Maarten Breddels a écrit :
> > Hi Antoine,
> >
> > Adding xsimd to the list of options:
> >  * https://github.com/xtensor-stack/xsimd
> > Not sure how it compares to the rest though.
> >
> > cheers,
> >
> > Maarten
> >
>

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Antoine Pitrou <an...@python.org>.
Thank you.  xsimd used to require C++14, but apparently they have
demoted it to C++11.  Good!

Regards

Antoine.


Le 09/06/2020 à 12:04, Maarten Breddels a écrit :
> Hi Antoine,
> 
> Adding xsimd to the list of options:
>  * https://github.com/xtensor-stack/xsimd
> Not sure how it compares to the rest though.
> 
> cheers,
> 
> Maarten
> 

Re: [C++][Discuss] Approaches for SIMD optimizations

Posted by Maarten Breddels <ma...@gmail.com>.
Hi Antoine,

Adding xsimd to the list of options:
 * https://github.com/xtensor-stack/xsimd
Not sure how it compares to the rest though.

cheers,

Maarten