You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/05/13 01:46:26 UTC

[C++] Runtime SIMD dispatching for Arrow

hi,

We've started to receive a number of patches providing SIMD operations
for both x86 and ARM architectures. Most of these patches make use of
compiler definitions to toggle between code paths at compile time.

This is problematic for a few reasons:

* Binaries that are shipped (e.g. in Python) must generally be
compiled for a broad set of supported compilers. That means that AVX2
/ AVX512 optimizations won't be available in these builds for
processors that have them
* Poses a maintainability and testing problem (hard to test every
combination, and it is not practical for local development to compile
every combination, which may cause drawn out test/CI/fix cycles)

Other projects (e.g. NumPy) have taken the approach of building
binaries that contain multiple variants of a function with different
levels of SIMD, and then choosing at runtime which one to execute
based on what features the CPU supports. This seems like what we
ultimately need to do in Apache Arrow, and if we continue to accept
patches that do not do this, it will be much more work later when we
have to refactor things to runtime dispatching.

We have some PRs in the queue related to SIMD. Without taking a heavy
handed approach like starting to veto PRs, how would everyone like to
begin to address the runtime dispatching problem?

Note that the Kernels revamp project I am working on right now will
also facilitate runtime SIMD kernel dispatching for array expression
evaluation.

Thanks,
Wes

RE: [C++] Runtime SIMD dispatching for Arrow

Posted by "Du, Frank" <fr...@intel.com>.

Yes , best to has dedicated AVX512 device. Great news that you are working on the machine😊

Thanks,
Frank

-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Monday, September 7, 2020 12:41 AM
To: dev <de...@arrow.apache.org>
Subject: Re: [C++] Runtime SIMD dispatching for Arrow

I might be able to contribute an AVX-512 capable machine for testing / benchmarking via Buildkite or similar in the next 6 months. It seems like dedicated hardware would be the best approach to get consistency there. If someone else would be able to contribute a reliable machine that would also be useful to know.

On Thu, Sep 3, 2020 at 10:29 PM Du, Frank <fr...@intel.com> wrote:
>
> Just want to give some updates on the dispatching.
>
> Now we has workable runtime functionality include dispatch mechanism[1][2] and build framework for both the compute kernels and other parts of C++. There are some remaining SIMD static complier code under the code base that I will try to work later to convert it to runtime path.
>
> The last issue I see is the CI part, it has an environment variant: ARROW_RUNTIME_SIMD_LEVEL[3] already can be leveraged to perform the SIMD level test, but we lack a CI device which always support AVX512. I did some factitious test to check which CI machine has AVX512 capacity and find below 4 tasks indeed capable, but unluckily it's not always 100%, something around 70%~80% chance it's scheduled to a AVX512 device.
>         C++ / AMD64 Windows 2019 C++
>         Python / AMD64 Conda Python 3.6 Pandas latest
>         Python / AMD64 Conda Python 3.6 Pandas 0.23
>         C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN I plan to add SIMD 
> test task with AVX512/AVX2/SSE4_2/NONE level on " C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN" and " C++ / AMD64 Windows 2019 C++" though it's not always scheduled to machine with AVX512, any idea or thoughts?
>
> [1] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/dispatc
> h.h [2] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kern
> el.h#L561 [3] 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu_inf
> o.cc#L451
>
> Thanks,
> Frank
>
> -----Original Message-----
> From: Wes McKinney <we...@gmail.com>
> Sent: Wednesday, May 13, 2020 9:39 PM
> To: dev <de...@arrow.apache.org>; Micah Kornfield 
> <em...@gmail.com>
> Subject: Re: [C++] Runtime SIMD dispatching for Arrow
>
> On Tue, May 12, 2020 at 11:12 PM Micah Kornfield <em...@gmail.com> wrote:
> >
> > >
> > > Since I develop on an AVX512-capable machine, if we have runtime 
> > > dispatching then it should be able to test all variants of a 
> > > function from a single executable / test run rather than having to 
> > > produce multiple builds and test them separately, right?
> >
> > Yes, but I think the same of true without runtime dispatching.  We 
> > might have different mental models for runtime dispatching so I'll 
> > put up a concrete example.  If we want optimized code for "some_function"
> > it would like like:
> >
> > #ifdef HAVE_AVX512
> > void some_function_512() {
> > ...
> > }
> > #endif
> >
> > void some_function_base() {
> > ...
> > }
> >
> > // static dispatching
> > void some_function() {
> > #ifdef HAVE_AVX512
> > some_function_512();
> > #else
> > some_function_base();
> > #endif
> > }
> >
> > // dynamic dispatch
> > void some_function() {
> >    static void()* chosen_function = Choose(cpu_info, 
> > &some_function_512, &some_function_base);
> >    *chosen_function();
> > }
> >
> > In both cases, we  need to have a tests which call into
> > some_function_512() and some_function_base().  It is possible with 
> > runtime dispatching we can write code in tests as something like:
> >
> > for (CpuInfo info : all_supported_architectures) {
> >     TEST(Choose(info, &some_function_512, &some_function_base)); }
> >
> > But I think there is likely something equivalent that we could to do 
> > with macro magic.
>
> That's one way. Or it could have a default configuration set external 
> to the binary, similar to things like OMP_NUM_THREADS
>
> ARROW_RUNTIME_SIMD_LEVEL=none ctest -L unittest
> ARROW_RUNTIME_SIMD_LEVEL=sse4.2 ctest -L unittest
> ARROW_RUNTIME_SIMD_LEVEL=avx2 ctest -L unittest
> ARROW_RUNTIME_SIMD_LEVEL=avx512 ctest -L unittest
>
> Either way it seems like a good idea to the number of #ifdef's in the 
> codebase and reduce the need to recompile
>
> > Did you have something different in mind?
> >
> > Micah
> >
> >
> >
> >
> >
> > On Tue, May 12, 2020 at 8:31 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yi...@arm.com> wrote:
> > > >
> > > > Thanks Wes, I'm glad to see this feature coming.
> > > >
> > > >  From history talks, the main concern is runtime dispatcher may 
> > > > cause
> > > performance issue.
> > > > Personally, I don't think it's a big problem. If we're using 
> > > > SIMD, it
> > > must be targeting some time consuming code.
> > > >
> > > > But we do need to take care some issues. E.g, I see code like this:
> > > > for (int i = 0; i < n; ++i) {
> > > >    simd_code();
> > > > }
> > > > With runtime dispatcher, it becomes an indirect function call in 
> > > > each
> > > iteration.
> > > > We should change the code to move the loop inside simd_code().
> > >
> > > To be clear, I'm referring to SIMD-optimized code that operates on 
> > > batches of data. The overhead of choosing an implementation based 
> > > on a global settings object should not be meaningful. If there is 
> > > performance-sensitive code at inline call sites then I agree that 
> > > it is an issue. I don't think that characterizes most of the 
> > > anticipated work in Arrow, though, since functions generally will 
> > > process a chunk/array of data at time (see, e.g. Parquet 
> > > encoding/decoding work recently).
> > >
> > > > It would be better if you can consider architectures other than 
> > > > x86(at
> > > framework level).
> > > > Ignore it if it costs much effort. We can always improve later.
> > > >
> > > > Yibo
> > > >
> > > > On 5/13/20 9:46 AM, Wes McKinney wrote:
> > > > > hi,
> > > > >
> > > > > We've started to receive a number of patches providing SIMD 
> > > > > operations for both x86 and ARM architectures. Most of these 
> > > > > patches make use of compiler definitions to toggle between code paths at compile time.
> > > > >
> > > > > This is problematic for a few reasons:
> > > > >
> > > > > * Binaries that are shipped (e.g. in Python) must generally be 
> > > > > compiled for a broad set of supported compilers. That means 
> > > > > that
> > > > > AVX2 / AVX512 optimizations won't be available in these builds 
> > > > > for processors that have them
> > > > > * Poses a maintainability and testing problem (hard to test 
> > > > > every combination, and it is not practical for local 
> > > > > development to compile every combination, which may cause 
> > > > > drawn out test/CI/fix cycles)
> > > > >
> > > > > Other projects (e.g. NumPy) have taken the approach of 
> > > > > building binaries that contain multiple variants of a function 
> > > > > with different levels of SIMD, and then choosing at runtime 
> > > > > which one to execute based on what features the CPU supports. 
> > > > > This seems like what we ultimately need to do in Apache Arrow, 
> > > > > and if we continue to accept patches that do not do this, it 
> > > > > will be much more work later when we have to refactor things to runtime dispatching.
> > > > >
> > > > > We have some PRs in the queue related to SIMD. Without taking 
> > > > > a heavy handed approach like starting to veto PRs, how would 
> > > > > everyone like to begin to address the runtime dispatching problem?
> > > > >
> > > > > Note that the Kernels revamp project I am working on right now 
> > > > > will also facilitate runtime SIMD kernel dispatching for array 
> > > > > expression evaluation.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > >

Re: [C++] Runtime SIMD dispatching for Arrow

Posted by Wes McKinney <we...@gmail.com>.

I might be able to contribute an AVX-512 capable machine for testing /
benchmarking via Buildkite or similar in the next 6 months. It seems
like dedicated hardware would be the best approach to get consistency
there. If someone else would be able to contribute a reliable machine
that would also be useful to know.

On Thu, Sep 3, 2020 at 10:29 PM Du, Frank <fr...@intel.com> wrote:
>
> Just want to give some updates on the dispatching.
>
> Now we has workable runtime functionality include dispatch mechanism[1][2] and build framework for both the compute kernels and other parts of C++. There are some remaining SIMD static complier code under the code base that I will try to work later to convert it to runtime path.
>
> The last issue I see is the CI part, it has an environment variant: ARROW_RUNTIME_SIMD_LEVEL[3] already can be leveraged to perform the SIMD level test, but we lack a CI device which always support AVX512. I did some factitious test to check which CI machine has AVX512 capacity and find below 4 tasks indeed capable, but unluckily it's not always 100%, something around 70%~80% chance it's scheduled to a AVX512 device.
>         C++ / AMD64 Windows 2019 C++
>         Python / AMD64 Conda Python 3.6 Pandas latest
>         Python / AMD64 Conda Python 3.6 Pandas 0.23
>         C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN
> I plan to add SIMD test task with AVX512/AVX2/SSE4_2/NONE level on " C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN" and " C++ / AMD64 Windows 2019 C++" though it's not always scheduled to machine with AVX512, any idea or thoughts?
>
> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/dispatch.h
> [2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L561
> [3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu_info.cc#L451
>
> Thanks,
> Frank
>
> -----Original Message-----
> From: Wes McKinney <we...@gmail.com>
> Sent: Wednesday, May 13, 2020 9:39 PM
> To: dev <de...@arrow.apache.org>; Micah Kornfield <em...@gmail.com>
> Subject: Re: [C++] Runtime SIMD dispatching for Arrow
>
> On Tue, May 12, 2020 at 11:12 PM Micah Kornfield <em...@gmail.com> wrote:
> >
> > >
> > > Since I develop on an AVX512-capable machine, if we have runtime
> > > dispatching then it should be able to test all variants of a
> > > function from a single executable / test run rather than having to
> > > produce multiple builds and test them separately, right?
> >
> > Yes, but I think the same of true without runtime dispatching.  We
> > might have different mental models for runtime dispatching so I'll put
> > up a concrete example.  If we want optimized code for "some_function"
> > it would like like:
> >
> > #ifdef HAVE_AVX512
> > void some_function_512() {
> > ...
> > }
> > #endif
> >
> > void some_function_base() {
> > ...
> > }
> >
> > // static dispatching
> > void some_function() {
> > #ifdef HAVE_AVX512
> > some_function_512();
> > #else
> > some_function_base();
> > #endif
> > }
> >
> > // dynamic dispatch
> > void some_function() {
> >    static void()* chosen_function = Choose(cpu_info,
> > &some_function_512, &some_function_base);
> >    *chosen_function();
> > }
> >
> > In both cases, we  need to have a tests which call into
> > some_function_512() and some_function_base().  It is possible with
> > runtime dispatching we can write code in tests as something like:
> >
> > for (CpuInfo info : all_supported_architectures) {
> >     TEST(Choose(info, &some_function_512, &some_function_base)); }
> >
> > But I think there is likely something equivalent that we could to do
> > with macro magic.
>
> That's one way. Or it could have a default configuration set external to the binary, similar to things like OMP_NUM_THREADS
>
> ARROW_RUNTIME_SIMD_LEVEL=none ctest -L unittest
> ARROW_RUNTIME_SIMD_LEVEL=sse4.2 ctest -L unittest
> ARROW_RUNTIME_SIMD_LEVEL=avx2 ctest -L unittest
> ARROW_RUNTIME_SIMD_LEVEL=avx512 ctest -L unittest
>
> Either way it seems like a good idea to the number of #ifdef's in the codebase and reduce the need to recompile
>
> > Did you have something different in mind?
> >
> > Micah
> >
> >
> >
> >
> >
> > On Tue, May 12, 2020 at 8:31 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yi...@arm.com> wrote:
> > > >
> > > > Thanks Wes, I'm glad to see this feature coming.
> > > >
> > > >  From history talks, the main concern is runtime dispatcher may
> > > > cause
> > > performance issue.
> > > > Personally, I don't think it's a big problem. If we're using SIMD,
> > > > it
> > > must be targeting some time consuming code.
> > > >
> > > > But we do need to take care some issues. E.g, I see code like this:
> > > > for (int i = 0; i < n; ++i) {
> > > >    simd_code();
> > > > }
> > > > With runtime dispatcher, it becomes an indirect function call in
> > > > each
> > > iteration.
> > > > We should change the code to move the loop inside simd_code().
> > >
> > > To be clear, I'm referring to SIMD-optimized code that operates on
> > > batches of data. The overhead of choosing an implementation based on
> > > a global settings object should not be meaningful. If there is
> > > performance-sensitive code at inline call sites then I agree that it
> > > is an issue. I don't think that characterizes most of the
> > > anticipated work in Arrow, though, since functions generally will
> > > process a chunk/array of data at time (see, e.g. Parquet
> > > encoding/decoding work recently).
> > >
> > > > It would be better if you can consider architectures other than
> > > > x86(at
> > > framework level).
> > > > Ignore it if it costs much effort. We can always improve later.
> > > >
> > > > Yibo
> > > >
> > > > On 5/13/20 9:46 AM, Wes McKinney wrote:
> > > > > hi,
> > > > >
> > > > > We've started to receive a number of patches providing SIMD
> > > > > operations for both x86 and ARM architectures. Most of these
> > > > > patches make use of compiler definitions to toggle between code paths at compile time.
> > > > >
> > > > > This is problematic for a few reasons:
> > > > >
> > > > > * Binaries that are shipped (e.g. in Python) must generally be
> > > > > compiled for a broad set of supported compilers. That means that
> > > > > AVX2 / AVX512 optimizations won't be available in these builds
> > > > > for processors that have them
> > > > > * Poses a maintainability and testing problem (hard to test
> > > > > every combination, and it is not practical for local development
> > > > > to compile every combination, which may cause drawn out
> > > > > test/CI/fix cycles)
> > > > >
> > > > > Other projects (e.g. NumPy) have taken the approach of building
> > > > > binaries that contain multiple variants of a function with
> > > > > different levels of SIMD, and then choosing at runtime which one
> > > > > to execute based on what features the CPU supports. This seems
> > > > > like what we ultimately need to do in Apache Arrow, and if we
> > > > > continue to accept patches that do not do this, it will be much
> > > > > more work later when we have to refactor things to runtime dispatching.
> > > > >
> > > > > We have some PRs in the queue related to SIMD. Without taking a
> > > > > heavy handed approach like starting to veto PRs, how would
> > > > > everyone like to begin to address the runtime dispatching problem?
> > > > >
> > > > > Note that the Kernels revamp project I am working on right now
> > > > > will also facilitate runtime SIMD kernel dispatching for array
> > > > > expression evaluation.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > >

RE: [C++] Runtime SIMD dispatching for Arrow

Posted by "Du, Frank" <fr...@intel.com>.

Just want to give some updates on the dispatching.

Now we has workable runtime functionality include dispatch mechanism[1][2] and build framework for both the compute kernels and other parts of C++. There are some remaining SIMD static complier code under the code base that I will try to work later to convert it to runtime path.

The last issue I see is the CI part, it has an environment variant: ARROW_RUNTIME_SIMD_LEVEL[3] already can be leveraged to perform the SIMD level test, but we lack a CI device which always support AVX512. I did some factitious test to check which CI machine has AVX512 capacity and find below 4 tasks indeed capable, but unluckily it's not always 100%, something around 70%~80% chance it's scheduled to a AVX512 device.
	C++ / AMD64 Windows 2019 C++
	Python / AMD64 Conda Python 3.6 Pandas latest
	Python / AMD64 Conda Python 3.6 Pandas 0.23
	C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN
I plan to add SIMD test task with AVX512/AVX2/SSE4_2/NONE level on " C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN" and " C++ / AMD64 Windows 2019 C++" though it's not always scheduled to machine with AVX512, any idea or thoughts?

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/dispatch.h
[2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L561
[3] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/cpu_info.cc#L451

Thanks,
Frank

-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Wednesday, May 13, 2020 9:39 PM
To: dev <de...@arrow.apache.org>; Micah Kornfield <em...@gmail.com>
Subject: Re: [C++] Runtime SIMD dispatching for Arrow

On Tue, May 12, 2020 at 11:12 PM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > Since I develop on an AVX512-capable machine, if we have runtime 
> > dispatching then it should be able to test all variants of a 
> > function from a single executable / test run rather than having to 
> > produce multiple builds and test them separately, right?
>
> Yes, but I think the same of true without runtime dispatching.  We 
> might have different mental models for runtime dispatching so I'll put 
> up a concrete example.  If we want optimized code for "some_function" 
> it would like like:
>
> #ifdef HAVE_AVX512
> void some_function_512() {
> ...
> }
> #endif
>
> void some_function_base() {
> ...
> }
>
> // static dispatching
> void some_function() {
> #ifdef HAVE_AVX512
> some_function_512();
> #else
> some_function_base();
> #endif
> }
>
> // dynamic dispatch
> void some_function() {
>    static void()* chosen_function = Choose(cpu_info, 
> &some_function_512, &some_function_base);
>    *chosen_function();
> }
>
> In both cases, we  need to have a tests which call into 
> some_function_512() and some_function_base().  It is possible with 
> runtime dispatching we can write code in tests as something like:
>
> for (CpuInfo info : all_supported_architectures) {
>     TEST(Choose(info, &some_function_512, &some_function_base)); }
>
> But I think there is likely something equivalent that we could to do 
> with macro magic.

That's one way. Or it could have a default configuration set external to the binary, similar to things like OMP_NUM_THREADS

ARROW_RUNTIME_SIMD_LEVEL=none ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=sse4.2 ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=avx2 ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=avx512 ctest -L unittest

Either way it seems like a good idea to the number of #ifdef's in the codebase and reduce the need to recompile

> Did you have something different in mind?
>
> Micah
>
>
>
>
>
> On Tue, May 12, 2020 at 8:31 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yi...@arm.com> wrote:
> > >
> > > Thanks Wes, I'm glad to see this feature coming.
> > >
> > >  From history talks, the main concern is runtime dispatcher may 
> > > cause
> > performance issue.
> > > Personally, I don't think it's a big problem. If we're using SIMD, 
> > > it
> > must be targeting some time consuming code.
> > >
> > > But we do need to take care some issues. E.g, I see code like this:
> > > for (int i = 0; i < n; ++i) {
> > >    simd_code();
> > > }
> > > With runtime dispatcher, it becomes an indirect function call in 
> > > each
> > iteration.
> > > We should change the code to move the loop inside simd_code().
> >
> > To be clear, I'm referring to SIMD-optimized code that operates on 
> > batches of data. The overhead of choosing an implementation based on 
> > a global settings object should not be meaningful. If there is 
> > performance-sensitive code at inline call sites then I agree that it 
> > is an issue. I don't think that characterizes most of the 
> > anticipated work in Arrow, though, since functions generally will 
> > process a chunk/array of data at time (see, e.g. Parquet 
> > encoding/decoding work recently).
> >
> > > It would be better if you can consider architectures other than 
> > > x86(at
> > framework level).
> > > Ignore it if it costs much effort. We can always improve later.
> > >
> > > Yibo
> > >
> > > On 5/13/20 9:46 AM, Wes McKinney wrote:
> > > > hi,
> > > >
> > > > We've started to receive a number of patches providing SIMD 
> > > > operations for both x86 and ARM architectures. Most of these 
> > > > patches make use of compiler definitions to toggle between code paths at compile time.
> > > >
> > > > This is problematic for a few reasons:
> > > >
> > > > * Binaries that are shipped (e.g. in Python) must generally be 
> > > > compiled for a broad set of supported compilers. That means that 
> > > > AVX2 / AVX512 optimizations won't be available in these builds 
> > > > for processors that have them
> > > > * Poses a maintainability and testing problem (hard to test 
> > > > every combination, and it is not practical for local development 
> > > > to compile every combination, which may cause drawn out 
> > > > test/CI/fix cycles)
> > > >
> > > > Other projects (e.g. NumPy) have taken the approach of building 
> > > > binaries that contain multiple variants of a function with 
> > > > different levels of SIMD, and then choosing at runtime which one 
> > > > to execute based on what features the CPU supports. This seems 
> > > > like what we ultimately need to do in Apache Arrow, and if we 
> > > > continue to accept patches that do not do this, it will be much 
> > > > more work later when we have to refactor things to runtime dispatching.
> > > >
> > > > We have some PRs in the queue related to SIMD. Without taking a 
> > > > heavy handed approach like starting to veto PRs, how would 
> > > > everyone like to begin to address the runtime dispatching problem?
> > > >
> > > > Note that the Kernels revamp project I am working on right now 
> > > > will also facilitate runtime SIMD kernel dispatching for array 
> > > > expression evaluation.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> >

Re: [C++] Runtime SIMD dispatching for Arrow

Posted by Wes McKinney <we...@gmail.com>.

On Tue, May 12, 2020 at 11:12 PM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > Since I develop on an AVX512-capable machine, if we have runtime
> > dispatching then it should be able to test all variants of a function
> > from a single executable / test run rather than having to produce
> > multiple builds and test them separately, right?
>
> Yes, but I think the same of true without runtime dispatching.  We might
> have different mental models for runtime dispatching so I'll put up a
> concrete example.  If we want optimized code for "some_function" it would
> like like:
>
> #ifdef HAVE_AVX512
> void some_function_512() {
> ...
> }
> #endif
>
> void some_function_base() {
> ...
> }
>
> // static dispatching
> void some_function() {
> #ifdef HAVE_AVX512
> some_function_512();
> #else
> some_function_base();
> #endif
> }
>
> // dynamic dispatch
> void some_function() {
>    static void()* chosen_function = Choose(cpu_info, &some_function_512,
> &some_function_base);
>    *chosen_function();
> }
>
> In both cases, we  need to have a tests which call into some_function_512()
> and some_function_base().  It is possible with runtime dispatching we can
> write code in tests as something like:
>
> for (CpuInfo info : all_supported_architectures) {
>     TEST(Choose(info, &some_function_512, &some_function_base));
> }
>
> But I think there is likely something equivalent that we could to do with
> macro magic.

That's one way. Or it could have a default configuration set external
to the binary, similar to things like OMP_NUM_THREADS

ARROW_RUNTIME_SIMD_LEVEL=none ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=sse4.2 ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=avx2 ctest -L unittest
ARROW_RUNTIME_SIMD_LEVEL=avx512 ctest -L unittest

Either way it seems like a good idea to the number of #ifdef's in the
codebase and reduce the need to recompile

> Did you have something different in mind?
>
> Micah
>
>
>
>
>
> On Tue, May 12, 2020 at 8:31 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yi...@arm.com> wrote:
> > >
> > > Thanks Wes, I'm glad to see this feature coming.
> > >
> > >  From history talks, the main concern is runtime dispatcher may cause
> > performance issue.
> > > Personally, I don't think it's a big problem. If we're using SIMD, it
> > must be targeting some time consuming code.
> > >
> > > But we do need to take care some issues. E.g, I see code like this:
> > > for (int i = 0; i < n; ++i) {
> > >    simd_code();
> > > }
> > > With runtime dispatcher, it becomes an indirect function call in each
> > iteration.
> > > We should change the code to move the loop inside simd_code().
> >
> > To be clear, I'm referring to SIMD-optimized code that operates on
> > batches of data. The overhead of choosing an implementation based on a
> > global settings object should not be meaningful. If there is
> > performance-sensitive code at inline call sites then I agree that it
> > is an issue. I don't think that characterizes most of the anticipated
> > work in Arrow, though, since functions generally will process a
> > chunk/array of data at time (see, e.g. Parquet encoding/decoding work
> > recently).
> >
> > > It would be better if you can consider architectures other than x86(at
> > framework level).
> > > Ignore it if it costs much effort. We can always improve later.
> > >
> > > Yibo
> > >
> > > On 5/13/20 9:46 AM, Wes McKinney wrote:
> > > > hi,
> > > >
> > > > We've started to receive a number of patches providing SIMD operations
> > > > for both x86 and ARM architectures. Most of these patches make use of
> > > > compiler definitions to toggle between code paths at compile time.
> > > >
> > > > This is problematic for a few reasons:
> > > >
> > > > * Binaries that are shipped (e.g. in Python) must generally be
> > > > compiled for a broad set of supported compilers. That means that AVX2
> > > > / AVX512 optimizations won't be available in these builds for
> > > > processors that have them
> > > > * Poses a maintainability and testing problem (hard to test every
> > > > combination, and it is not practical for local development to compile
> > > > every combination, which may cause drawn out test/CI/fix cycles)
> > > >
> > > > Other projects (e.g. NumPy) have taken the approach of building
> > > > binaries that contain multiple variants of a function with different
> > > > levels of SIMD, and then choosing at runtime which one to execute
> > > > based on what features the CPU supports. This seems like what we
> > > > ultimately need to do in Apache Arrow, and if we continue to accept
> > > > patches that do not do this, it will be much more work later when we
> > > > have to refactor things to runtime dispatching.
> > > >
> > > > We have some PRs in the queue related to SIMD. Without taking a heavy
> > > > handed approach like starting to veto PRs, how would everyone like to
> > > > begin to address the runtime dispatching problem?
> > > >
> > > > Note that the Kernels revamp project I am working on right now will
> > > > also facilitate runtime SIMD kernel dispatching for array expression
> > > > evaluation.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> >

Re: [C++] Runtime SIMD dispatching for Arrow

Posted by Micah Kornfield <em...@gmail.com>.

>
> Since I develop on an AVX512-capable machine, if we have runtime
> dispatching then it should be able to test all variants of a function
> from a single executable / test run rather than having to produce
> multiple builds and test them separately, right?

Yes, but I think the same of true without runtime dispatching.  We might
have different mental models for runtime dispatching so I'll put up a
concrete example.  If we want optimized code for "some_function" it would
like like:

#ifdef HAVE_AVX512
void some_function_512() {
...
}
#endif

void some_function_base() {
...
}

// static dispatching
void some_function() {
#ifdef HAVE_AVX512
some_function_512();
#else
some_function_base();
#endif
}

// dynamic dispatch
void some_function() {
   static void()* chosen_function = Choose(cpu_info, &some_function_512,
&some_function_base);
   *chosen_function();
}

In both cases, we  need to have a tests which call into some_function_512()
and some_function_base().  It is possible with runtime dispatching we can
write code in tests as something like:

for (CpuInfo info : all_supported_architectures) {
    TEST(Choose(info, &some_function_512, &some_function_base));
}

But I think there is likely something equivalent that we could to do with
macro magic.

Did you have something different in mind?

Micah





On Tue, May 12, 2020 at 8:31 PM Wes McKinney <we...@gmail.com> wrote:

> On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yi...@arm.com> wrote:
> >
> > Thanks Wes, I'm glad to see this feature coming.
> >
> >  From history talks, the main concern is runtime dispatcher may cause
> performance issue.
> > Personally, I don't think it's a big problem. If we're using SIMD, it
> must be targeting some time consuming code.
> >
> > But we do need to take care some issues. E.g, I see code like this:
> > for (int i = 0; i < n; ++i) {
> >    simd_code();
> > }
> > With runtime dispatcher, it becomes an indirect function call in each
> iteration.
> > We should change the code to move the loop inside simd_code().
>
> To be clear, I'm referring to SIMD-optimized code that operates on
> batches of data. The overhead of choosing an implementation based on a
> global settings object should not be meaningful. If there is
> performance-sensitive code at inline call sites then I agree that it
> is an issue. I don't think that characterizes most of the anticipated
> work in Arrow, though, since functions generally will process a
> chunk/array of data at time (see, e.g. Parquet encoding/decoding work
> recently).
>
> > It would be better if you can consider architectures other than x86(at
> framework level).
> > Ignore it if it costs much effort. We can always improve later.
> >
> > Yibo
> >
> > On 5/13/20 9:46 AM, Wes McKinney wrote:
> > > hi,
> > >
> > > We've started to receive a number of patches providing SIMD operations
> > > for both x86 and ARM architectures. Most of these patches make use of
> > > compiler definitions to toggle between code paths at compile time.
> > >
> > > This is problematic for a few reasons:
> > >
> > > * Binaries that are shipped (e.g. in Python) must generally be
> > > compiled for a broad set of supported compilers. That means that AVX2
> > > / AVX512 optimizations won't be available in these builds for
> > > processors that have them
> > > * Poses a maintainability and testing problem (hard to test every
> > > combination, and it is not practical for local development to compile
> > > every combination, which may cause drawn out test/CI/fix cycles)
> > >
> > > Other projects (e.g. NumPy) have taken the approach of building
> > > binaries that contain multiple variants of a function with different
> > > levels of SIMD, and then choosing at runtime which one to execute
> > > based on what features the CPU supports. This seems like what we
> > > ultimately need to do in Apache Arrow, and if we continue to accept
> > > patches that do not do this, it will be much more work later when we
> > > have to refactor things to runtime dispatching.
> > >
> > > We have some PRs in the queue related to SIMD. Without taking a heavy
> > > handed approach like starting to veto PRs, how would everyone like to
> > > begin to address the runtime dispatching problem?
> > >
> > > Note that the Kernels revamp project I am working on right now will
> > > also facilitate runtime SIMD kernel dispatching for array expression
> > > evaluation.
> > >
> > > Thanks,
> > > Wes
> > >
>

Re: [C++] Runtime SIMD dispatching for Arrow

Posted by Wes McKinney <we...@gmail.com>.

On Tue, May 12, 2020 at 9:47 PM Yibo Cai <yi...@arm.com> wrote:
>
> Thanks Wes, I'm glad to see this feature coming.
>
>  From history talks, the main concern is runtime dispatcher may cause performance issue.
> Personally, I don't think it's a big problem. If we're using SIMD, it must be targeting some time consuming code.
>
> But we do need to take care some issues. E.g, I see code like this:
> for (int i = 0; i < n; ++i) {
>    simd_code();
> }
> With runtime dispatcher, it becomes an indirect function call in each iteration.
> We should change the code to move the loop inside simd_code().

To be clear, I'm referring to SIMD-optimized code that operates on
batches of data. The overhead of choosing an implementation based on a
global settings object should not be meaningful. If there is
performance-sensitive code at inline call sites then I agree that it
is an issue. I don't think that characterizes most of the anticipated
work in Arrow, though, since functions generally will process a
chunk/array of data at time (see, e.g. Parquet encoding/decoding work
recently).

> It would be better if you can consider architectures other than x86(at framework level).
> Ignore it if it costs much effort. We can always improve later.
>
> Yibo
>
> On 5/13/20 9:46 AM, Wes McKinney wrote:
> > hi,
> >
> > We've started to receive a number of patches providing SIMD operations
> > for both x86 and ARM architectures. Most of these patches make use of
> > compiler definitions to toggle between code paths at compile time.
> >
> > This is problematic for a few reasons:
> >
> > * Binaries that are shipped (e.g. in Python) must generally be
> > compiled for a broad set of supported compilers. That means that AVX2
> > / AVX512 optimizations won't be available in these builds for
> > processors that have them
> > * Poses a maintainability and testing problem (hard to test every
> > combination, and it is not practical for local development to compile
> > every combination, which may cause drawn out test/CI/fix cycles)
> >
> > Other projects (e.g. NumPy) have taken the approach of building
> > binaries that contain multiple variants of a function with different
> > levels of SIMD, and then choosing at runtime which one to execute
> > based on what features the CPU supports. This seems like what we
> > ultimately need to do in Apache Arrow, and if we continue to accept
> > patches that do not do this, it will be much more work later when we
> > have to refactor things to runtime dispatching.
> >
> > We have some PRs in the queue related to SIMD. Without taking a heavy
> > handed approach like starting to veto PRs, how would everyone like to
> > begin to address the runtime dispatching problem?
> >
> > Note that the Kernels revamp project I am working on right now will
> > also facilitate runtime SIMD kernel dispatching for array expression
> > evaluation.
> >
> > Thanks,
> > Wes
> >

Re: [C++] Runtime SIMD dispatching for Arrow

Posted by Yibo Cai <yi...@arm.com>.

Thanks Wes, I'm glad to see this feature coming.

 From history talks, the main concern is runtime dispatcher may cause performance issue.
Personally, I don't think it's a big problem. If we're using SIMD, it must be targeting some time consuming code.

But we do need to take care some issues. E.g, I see code like this:
for (int i = 0; i < n; ++i) {
   simd_code();
}
With runtime dispatcher, it becomes an indirect function call in each iteration.
We should change the code to move the loop inside simd_code().

It would be better if you can consider architectures other than x86(at framework level).
Ignore it if it costs much effort. We can always improve later.

Yibo

On 5/13/20 9:46 AM, Wes McKinney wrote:
> hi,
> 
> We've started to receive a number of patches providing SIMD operations
> for both x86 and ARM architectures. Most of these patches make use of
> compiler definitions to toggle between code paths at compile time.
> 
> This is problematic for a few reasons:
> 
> * Binaries that are shipped (e.g. in Python) must generally be
> compiled for a broad set of supported compilers. That means that AVX2
> / AVX512 optimizations won't be available in these builds for
> processors that have them
> * Poses a maintainability and testing problem (hard to test every
> combination, and it is not practical for local development to compile
> every combination, which may cause drawn out test/CI/fix cycles)
> 
> Other projects (e.g. NumPy) have taken the approach of building
> binaries that contain multiple variants of a function with different
> levels of SIMD, and then choosing at runtime which one to execute
> based on what features the CPU supports. This seems like what we
> ultimately need to do in Apache Arrow, and if we continue to accept
> patches that do not do this, it will be much more work later when we
> have to refactor things to runtime dispatching.
> 
> We have some PRs in the queue related to SIMD. Without taking a heavy
> handed approach like starting to veto PRs, how would everyone like to
> begin to address the runtime dispatching problem?
> 
> Note that the Kernels revamp project I am working on right now will
> also facilitate runtime SIMD kernel dispatching for array expression
> evaluation.
> 
> Thanks,
> Wes
>

Re: [C++] Runtime SIMD dispatching for Arrow

Posted by Wes McKinney <we...@gmail.com>.

On Tue, May 12, 2020 at 10:19 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes,
> I think you highlighted the two issues well, but I think they are somewhat
> orthogonal and runtime dispatching only addresses the binary availability
> of the optimizations (but actually makes testing harder because it can
> potentially hide untested code paths).

Since I develop on an AVX512-capable machine, if we have runtime
dispatching then it should be able to test all variants of a function
from a single executable / test run rather than having to produce
multiple builds and test them separately, right?

Presumably the SIMD-level at runtime would be configurable, so you
could either let it be automatically selected based on your CPU
capabilities or set manually (e.g. if you want to do perf testing with
SIMD vs. no SIMD at runtime).

> Personally, I think it is valuable to have SIMD optimization in the code
> base even if our binaries aren't shipped with them as long as we have
> sufficient regression testing.
>
> For testability, I think there are two issues:
> A.  Resources available to test architecture specific code -  To solve this
> issue I think we choose a "latest" architecture to target.  Community
> members that want to target a more modern architecture than the community
> agreed upon architecture  would have the onus to augment testing resources
> with that architecture.  The recent Big-Endian CI coverage is a good
> example of this.  I don't think it is heavy handed to reject PRs if we
> don't have sufficient CI coverage.
>
> B.  Ensuring we have a sufficient test coverage for all code paths.  I
> think this breaks down into how we structure our code.  I know I've
> submitted a recent PR that makes it difficult to test each path separately,
> I will try to address this before submission.  Note, that that structuring
> the code so that each path can be tested independently is a precursor to
> runtime dispatch.  Once we agree on a "latest" architecture, if the code is
> structured appropriately, we should get sufficient code coverage by
> targeting the community decided "latest" architecture for most builds (and
> not having to do a full matrix of architectural changes).
>
> Thanks,
> Micah
>
>
>
>
>
>
> On Tue, May 12, 2020 at 6:47 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi,
> >
> > We've started to receive a number of patches providing SIMD operations
> > for both x86 and ARM architectures. Most of these patches make use of
> > compiler definitions to toggle between code paths at compile time.
> >
> > This is problematic for a few reasons:
> >
> > * Binaries that are shipped (e.g. in Python) must generally be
> > compiled for a broad set of supported compilers. That means that AVX2
> > / AVX512 optimizations won't be available in these builds for
> > processors that have them
> > * Poses a maintainability and testing problem (hard to test every
> > combination, and it is not practical for local development to compile
> > every combination, which may cause drawn out test/CI/fix cycles)
> >
> > Other projects (e.g. NumPy) have taken the approach of building
> > binaries that contain multiple variants of a function with different
> > levels of SIMD, and then choosing at runtime which one to execute
> > based on what features the CPU supports. This seems like what we
> > ultimately need to do in Apache Arrow, and if we continue to accept
> > patches that do not do this, it will be much more work later when we
> > have to refactor things to runtime dispatching.
> >
> > We have some PRs in the queue related to SIMD. Without taking a heavy
> > handed approach like starting to veto PRs, how would everyone like to
> > begin to address the runtime dispatching problem?
> >
> > Note that the Kernels revamp project I am working on right now will
> > also facilitate runtime SIMD kernel dispatching for array expression
> > evaluation.
> >
> > Thanks,
> > Wes
> >

Re: [C++] Runtime SIMD dispatching for Arrow

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes,
I think you highlighted the two issues well, but I think they are somewhat
orthogonal and runtime dispatching only addresses the binary availability
of the optimizations (but actually makes testing harder because it can
potentially hide untested code paths).

Personally, I think it is valuable to have SIMD optimization in the code
base even if our binaries aren't shipped with them as long as we have
sufficient regression testing.

For testability, I think there are two issues:
A.  Resources available to test architecture specific code -  To solve this
issue I think we choose a "latest" architecture to target.  Community
members that want to target a more modern architecture than the community
agreed upon architecture  would have the onus to augment testing resources
with that architecture.  The recent Big-Endian CI coverage is a good
example of this.  I don't think it is heavy handed to reject PRs if we
don't have sufficient CI coverage.

B.  Ensuring we have a sufficient test coverage for all code paths.  I
think this breaks down into how we structure our code.  I know I've
submitted a recent PR that makes it difficult to test each path separately,
I will try to address this before submission.  Note, that that structuring
the code so that each path can be tested independently is a precursor to
runtime dispatch.  Once we agree on a "latest" architecture, if the code is
structured appropriately, we should get sufficient code coverage by
targeting the community decided "latest" architecture for most builds (and
not having to do a full matrix of architectural changes).

Thanks,
Micah

On Tue, May 12, 2020 at 6:47 PM Wes McKinney <we...@gmail.com> wrote:

> hi,
>
> We've started to receive a number of patches providing SIMD operations
> for both x86 and ARM architectures. Most of these patches make use of
> compiler definitions to toggle between code paths at compile time.
>
> This is problematic for a few reasons:
>
> * Binaries that are shipped (e.g. in Python) must generally be
> compiled for a broad set of supported compilers. That means that AVX2
> / AVX512 optimizations won't be available in these builds for
> processors that have them
> * Poses a maintainability and testing problem (hard to test every
> combination, and it is not practical for local development to compile
> every combination, which may cause drawn out test/CI/fix cycles)
>
> Other projects (e.g. NumPy) have taken the approach of building
> binaries that contain multiple variants of a function with different
> levels of SIMD, and then choosing at runtime which one to execute
> based on what features the CPU supports. This seems like what we
> ultimately need to do in Apache Arrow, and if we continue to accept
> patches that do not do this, it will be much more work later when we
> have to refactor things to runtime dispatching.
>
> We have some PRs in the queue related to SIMD. Without taking a heavy
> handed approach like starting to veto PRs, how would everyone like to
> begin to address the runtime dispatching problem?
>
> Note that the Kernels revamp project I am working on right now will
> also facilitate runtime SIMD kernel dispatching for array expression
> evaluation.
>
> Thanks,
> Wes
>

RE: [C++] Runtime SIMD dispatching for Arrow

Posted by "Du, Frank" <fr...@intel.com>.

Hi,

I totally agree that arrow should has a built-in support for runtime dispatching facilities just like other popular computing libs to fully utilize the modern hardware capacity, we feel arrow has great potential performance chance with the advanced cpu SIMD feature. 

It's ok for me to stop the current SIMD PR, only concern is how long a basic runtime policy can be ready to leverage? Dose the kernel refactoring include a runtime dispatching already?

Thanks,
Frank

-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Wednesday, May 13, 2020 9:46 AM
To: dev <de...@arrow.apache.org>
Subject: [C++] Runtime SIMD dispatching for Arrow

hi,

We've started to receive a number of patches providing SIMD operations for both x86 and ARM architectures. Most of these patches make use of compiler definitions to toggle between code paths at compile time.

This is problematic for a few reasons:

* Binaries that are shipped (e.g. in Python) must generally be compiled for a broad set of supported compilers. That means that AVX2 / AVX512 optimizations won't be available in these builds for processors that have them
* Poses a maintainability and testing problem (hard to test every combination, and it is not practical for local development to compile every combination, which may cause drawn out test/CI/fix cycles)

Other projects (e.g. NumPy) have taken the approach of building binaries that contain multiple variants of a function with different levels of SIMD, and then choosing at runtime which one to execute based on what features the CPU supports. This seems like what we ultimately need to do in Apache Arrow, and if we continue to accept patches that do not do this, it will be much more work later when we have to refactor things to runtime dispatching.

We have some PRs in the queue related to SIMD. Without taking a heavy handed approach like starting to veto PRs, how would everyone like to begin to address the runtime dispatching problem?

Note that the Kernels revamp project I am working on right now will also facilitate runtime SIMD kernel dispatching for array expression evaluation.

Thanks,
Wes