You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Sasha Krassovsky <kr...@gmail.com> on 2022/04/01 06:43:33 UTC

Re: [C++] Replacing xsimd with compiler autovectorization

I agree that a potential inconsistent experience is a problem, but I
disagree that SIMD would be the root of the problem, or even be a
significant contributor to it.
The problem is essentially: "How can we be sure that all compilers will
generate good code on all platforms?" As you said, we have a lot of
platforms so that's not really practical.
I think that since we already try to use autovectorization in the kernels
subsystem, none of these problems are new. By not enabling AVX2 when it
would be simple enough to do so is akin to disabling compiler optimizations
because they may make the program better for some people. On the other
hand, rewriting everything to be explicitly vectorized is also much more
work than just enabling more instruction sets for autovectorization. And
lastly, who can say that xsimd will be compiled properly and perform better
than autovectorization?

So overall, no matter what, we'll have to rewrite the kernel system to be
more SIMD amenable, and enable the relevant instruction sets in the build
system. I don't see writing everything without xsimd introducing any
problems that wouldn't exist with xsimd. I would be fine keeping xsimd
around to give us opportunities to further tune performance. At the very
least, for an initial PR, I would like to keep everything simpler. We can
then evaluate xsimd-fying the kernels separately.

Sasha

On Thu, Mar 31, 2022 at 12:36 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 31/03/2022 à 09:19, Sasha Krassovsky a écrit :
> >> As I showed, those auto-vectorized kernels may be vectorized only in
> some situations, depending on the compiler version, the input datatypes...
> >
> > I would more than anything interpret the fact that that code was
> vectorized at all as an amazing win for compiler technology, as it’s a very
> abstract way of gluing together different pieces of code using templates
> and lambda expressions.
>
> That's a possible interpretation, but it doesn't really help the bottom
> line :-)
>
> > A lot of the kernels that we would be writing are probably basic unit
> tests [1] for the compiler’s vectorizer, and I’ve hopefully shown that even
> very old versions do just fine.
> >
> > Anyway, in the worst case we will eventually write every kernel with
> xsimd, and have the autovectorized kernels temporarily there. If we find
> that performance is good on our platforms, then we can skip the “rewrite in
> xsimd” step.
>
> "Our platforms" are rather broad however. We have binary packages for
> Windows, macOS, Linux, using several compilers and toolchains (because
> there are R packages, Python packages and sometimes C++ packages). For
> example, on Windows the R packages are built with different versions of
> MinGW/gcc depending on the R version, while the Python packages are
> built with some version of MSVC (which might be of a different version
> depending on whether it's a conda package or a Python wheel, I'm not sure).
>
> And there are of course the different architectures: we support x86 and
> arm64 for both macOS and Linux, for example; we might even have ppc64
> packages of some sort (?).
>
> Regards
>
> Antoine.
>

Re: [C++] Replacing xsimd with compiler autovectorization

Posted by Antoine Pitrou <an...@python.org>.
Le 03/04/2022 à 21:38, Sasha Krassovsky a écrit :
> 
>> There is concrete proof that autovectorization produces very flimsy results (even on the same compiler, simply by varying the datatypes).
> 
> As I’ve shown, the Vector-Vector Add kernel example is consistently vectorized well across compilers if written in a simple way.

Does it handle a validity bitmap efficiently? Does it handle an entire 
range of datatypes? Does it handle both array and scalar inputs? If not, 
how would you propose to handle all these? Chances are, you'll end up 
rewriting another array of template abstractions.

> Until I’ve seen a poorly-vectorized scalar kernel written as a simple for loop, I consider these arguments theoretical as well.

This makes little sense. The Arrow C++ codebase is not "theoretical", 
it's what you are presently working on.

> It seems that we’re in agreement at least in terms of concrete action for an initial PR: make the kernels system more SIMD-amenable and enable the several-times-compilation of source files to at least enable the instruction sets. Next, we can evaluate which kernels it’s worth to rewrite in terms of xsimd. Does that sound right?

Indeed you can have an initial stab at that.

Regards

Antoine.


> 
> Sasha
> 
> 
>> 3 апр. 2022 г., в 11:47, Antoine Pitrou <an...@python.org> написал(а):
>>
>> It would be a very significant contributor, as the inconsistency can manifest under the form of up to 8-fold differences in performance (or perhaps more).
> 

Re: [C++] Replacing xsimd with compiler autovectorization

Posted by Sasha Krassovsky <kr...@gmail.com>.
> It would be a very significant contributor, as the inconsistency can manifest under the form of up to 8-fold differences in performance (or perhaps more).

This is on a micro benchmark. For a user workload, the kernel will account for maybe 20% of the runtime, so even if the kernel gets 10x faster the user workload will only be 18% faster (or in the ballpark, I didn’t math it rigorously). 

> There is concrete proof that autovectorization produces very flimsy results (even on the same compiler, simply by varying the datatypes).

There is concrete proof of flimsy results for large template monsters, hidden behind layers of indirection across several source files. As I’ve shown, the Vector-Vector Add kernel example is consistently vectorized well across compilers if written in a simple way. Until I’ve seen a poorly-vectorized scalar kernel written as a simple for loop, I consider these arguments theoretical as well. 

> There is a far cry however, between the proposal of leveraging autovectorization as a first step towards better performance

Yes, I did amend my proposal earlier in this thread, saying that leaving xsimd in and using it for kernels that don’t autovectorize well would work. 

It seems that we’re in agreement at least in terms of concrete action for an initial PR: make the kernels system more SIMD-amenable and enable the several-times-compilation of source files to at least enable the instruction sets. Next, we can evaluate which kernels it’s worth to rewrite in terms of xsimd. Does that sound right?

Sasha


> 3 апр. 2022 г., в 11:47, Antoine Pitrou <an...@python.org> написал(а):
> 
> It would be a very significant contributor, as the inconsistency can manifest under the form of up to 8-fold differences in performance (or perhaps more).

Re: [C++] Replacing xsimd with compiler autovectorization

Posted by Antoine Pitrou <an...@python.org>.
Le 01/04/2022 à 08:43, Sasha Krassovsky a écrit :
> I agree that a potential inconsistent experience is a problem, but I
> disagree that SIMD would be the root of the problem, or even be a
> significant contributor to it.

It would be a very significant contributor, as the inconsistency can 
manifest under the form of up to 8-fold differences in performance (or 
perhaps more). Usually, compiler differences can produce differences on 
the order of a few tens of percent, and that is understood by users, but 
an order of magnitude is unexpected and makes it more difficult to 
advertise Arrow performance.

> And
> lastly, who can say that xsimd will be compiled properly and perform better
> than autovectorization?

I don't think such theoretical statements help a lot. There is concrete 
proof that autovectorization produces very flimsy results (even on the 
same compiler, simply by varying the datatypes). Is there any proof that 
xsimd-using code produces such fragile results? I haven't seen any, and 
it is unlikely to be the case (why would a compiler deoptimize SIMD 
intrinsics into plain scalar code?).

> At the very
> least, for an initial PR, I would like to keep everything simpler.

We can certainly do that. There is a far cry however, between the 
proposal of leveraging autovectorization as a first step towards better 
performance, and the original proposal of removing the xsimd dependency 
and only relying on autovectorization for future efforts :-)

Regards

Antoine.