You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2021/07/16 16:08:26 UTC

[DISCUSS][C++] Strategies for SIMD cross-compilation?

hi folks,

I had a conversation with the developers of xsimd last week in Paris
and was made aware that they are working on a substantial refactor of
xsimd to improve its usability for cross-compilation and
dynamic-dispatch based on runtime processor capabilities. The branch
with the refactor is located here:

https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring

In particular, the simd batch API is changing from

template <class T, size_t N>
class batch;

to

template <class T, class arch>
class batch;

So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,
you would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.
neon/neon64 for ARM ISAs) and then access the batch size through the
batch::size static property.

A few comments for discussion / investigation:

* Firstly, we will have to prepare ourselves to migrate to this new
API in the future

* At some point, we will likely want to generate SIMD-variants of our
C++ math kernels usable via dynamic dispatch for each different CPU
support level. It would be beneficial to author as much code in an
ISA-independent fashion that can be cross-compiled to generate binary
code for each ISA. We should investigate whether the new approach in
xsimd will provide what we need or if we need to take a different
approach.

* We have some of our own dynamic dispatch code to enable runtime
function pointer selection based on available SIMD levels. Can we
benefit from any of the work that is happening in this xsimd refactor?

* We have some compute code (e.g. hash tables for aggregation / joins)
that uses explicit AVX2 intrinsics — can some of this code be ported
to use generic xsimd APIs or will we need to use a different
fundamental algorithm design to yield maximum efficiency for each SIMD
ISA?

Thanks,
Wes

Re: [DISCUSS][C++] Strategies for SIMD cross-compilation?

Posted by Wes McKinney <we...@gmail.com>.

Since there hasn't been too much discussion, and we aren't ready to
port our existing use of xsimd to use the new API, I suggest we return
to this topic when there is a push to develop more SIMD-enabled
variants of functions in the C++ library, but I wanted to raise it
while it was on mind to get people thinking about it. It seems that a
lot of the recent compute work has been about enabling essential
feature coverage.

On Sun, Jul 18, 2021 at 11:41 PM Yuqi Gu <gu...@apache.org> wrote:
>
> *> So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,>you
> would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.>neon/neon64
> for ARM ISAs) and then access the batch size through the>batch::size static
> property.*
>
> Glad to see xsimd use 'Arch' as the parameter of a 'batch'.
> For the ARROW-11502 <https://github.com/apache/arrow/pull/9424>, I've
> submitted several PRs to xsimd to hide arch dependent code in Arrow  for
> avoiding a large maintenance burden.
> But it was found that it's hard to design an Arch-independent API of a
> specific feature to cover all different ISAs.
> Some specific features exist in x86, but do not exist in Arm64 and vice
> versa. It would take more code maintenance burden to unify these
> differences.
>
> Agree with Yibo to use the new xsimd approach as the
> dynamic runtime dispatch for each different CPUs support.
> support level.
>
> BRs,
> Yuqi
>
>
>
>
>
> Yibo Cai <yi...@arm.com> 于2021年7月19日周一 上午10:55写道：
>
> >
> >
> > On 7/17/21 12:08 AM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > I had a conversation with the developers of xsimd last week in Paris
> > > and was made aware that they are working on a substantial refactor of
> > > xsimd to improve its usability for cross-compilation and
> > > dynamic-dispatch based on runtime processor capabilities. The branch
> > > with the refactor is located here:
> > >
> > > https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring
> > >
> > > In particular, the simd batch API is changing from
> > >
> > > template <class T, size_t N>
> > > class batch;
> > >
> > > to
> > >
> > > template <class T, class arch>
> > > class batch;
> > >
> > > So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,
> > > you would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.
> > > neon/neon64 for ARM ISAs) and then access the batch size through the
> > > batch::size static property.
> >
> > Adding this 'arch' parameter is a bit strange at first glance, given the
> > purpose of an simd wrapper is to hide arch dependent code.
> > But as latest simd isa (sve, avx512) has much richer features than
> > simply widening the data width, looks arch code is a must.
> > I think this change won't cause trouble to existing xsimd client code.
> >
> > >
> > > A few comments for discussion / investigation:
> > >
> > > * Firstly, we will have to prepare ourselves to migrate to this new
> > > API in the future
> > >
> > > * At some point, we will likely want to generate SIMD-variants of our
> > > C++ math kernels usable via dynamic dispatch for each different CPU
> > > support level. It would be beneficial to author as much code in an
> > > ISA-independent fashion that can be cross-compiled to generate binary
> > > code for each ISA. We should investigate whether the new approach in
> > > xsimd will provide what we need or if we need to take a different
> > > approach.
> > >
> > > * We have some of our own dynamic dispatch code to enable runtime
> > > function pointer selection based on available SIMD levels. Can we
> > > benefit from any of the work that is happening in this xsimd refactor?
> >
> > I think they have some overlaps. Runtime dispatch at xsimd level(simd
> > code block) looks better than at kernel dispatch level, IIUC.
> >
> > >
> > > * We have some compute code (e.g. hash tables for aggregation / joins)
> > > that uses explicit AVX2 intrinsics — can some of this code be ported
> > > to use generic xsimd APIs or will we need to use a different
> > > fundamental algorithm design to yield maximum efficiency for each SIMD
> > > ISA?
> > >
> > > Thanks,
> > > Wes
> > >
> >

Re: [DISCUSS][C++] Strategies for SIMD cross-compilation?

Posted by Yuqi Gu <gu...@apache.org>.

*> So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,>you
would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.>neon/neon64
for ARM ISAs) and then access the batch size through the>batch::size static
property.*

Glad to see xsimd use 'Arch' as the parameter of a 'batch'.
For the ARROW-11502 <https://github.com/apache/arrow/pull/9424>, I've
submitted several PRs to xsimd to hide arch dependent code in Arrow  for
avoiding a large maintenance burden.
But it was found that it's hard to design an Arch-independent API of a
specific feature to cover all different ISAs.
Some specific features exist in x86, but do not exist in Arm64 and vice
versa. It would take more code maintenance burden to unify these
differences.

Agree with Yibo to use the new xsimd approach as the
dynamic runtime dispatch for each different CPUs support.
support level.

BRs,
Yuqi





Yibo Cai <yi...@arm.com> 于2021年7月19日周一 上午10:55写道：

>
>
> On 7/17/21 12:08 AM, Wes McKinney wrote:
> > hi folks,
> >
> > I had a conversation with the developers of xsimd last week in Paris
> > and was made aware that they are working on a substantial refactor of
> > xsimd to improve its usability for cross-compilation and
> > dynamic-dispatch based on runtime processor capabilities. The branch
> > with the refactor is located here:
> >
> > https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring
> >
> > In particular, the simd batch API is changing from
> >
> > template <class T, size_t N>
> > class batch;
> >
> > to
> >
> > template <class T, class arch>
> > class batch;
> >
> > So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,
> > you would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.
> > neon/neon64 for ARM ISAs) and then access the batch size through the
> > batch::size static property.
>
> Adding this 'arch' parameter is a bit strange at first glance, given the
> purpose of an simd wrapper is to hide arch dependent code.
> But as latest simd isa (sve, avx512) has much richer features than
> simply widening the data width, looks arch code is a must.
> I think this change won't cause trouble to existing xsimd client code.
>
> >
> > A few comments for discussion / investigation:
> >
> > * Firstly, we will have to prepare ourselves to migrate to this new
> > API in the future
> >
> > * At some point, we will likely want to generate SIMD-variants of our
> > C++ math kernels usable via dynamic dispatch for each different CPU
> > support level. It would be beneficial to author as much code in an
> > ISA-independent fashion that can be cross-compiled to generate binary
> > code for each ISA. We should investigate whether the new approach in
> > xsimd will provide what we need or if we need to take a different
> > approach.
> >
> > * We have some of our own dynamic dispatch code to enable runtime
> > function pointer selection based on available SIMD levels. Can we
> > benefit from any of the work that is happening in this xsimd refactor?
>
> I think they have some overlaps. Runtime dispatch at xsimd level(simd
> code block) looks better than at kernel dispatch level, IIUC.
>
> >
> > * We have some compute code (e.g. hash tables for aggregation / joins)
> > that uses explicit AVX2 intrinsics — can some of this code be ported
> > to use generic xsimd APIs or will we need to use a different
> > fundamental algorithm design to yield maximum efficiency for each SIMD
> > ISA?
> >
> > Thanks,
> > Wes
> >
>

Re: [DISCUSS][C++] Strategies for SIMD cross-compilation?

Posted by Yibo Cai <yi...@arm.com>.


On 7/17/21 12:08 AM, Wes McKinney wrote:
> hi folks,
> 
> I had a conversation with the developers of xsimd last week in Paris
> and was made aware that they are working on a substantial refactor of
> xsimd to improve its usability for cross-compilation and
> dynamic-dispatch based on runtime processor capabilities. The branch
> with the refactor is located here:
> 
> https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring
> 
> In particular, the simd batch API is changing from
> 
> template <class T, size_t N>
> class batch;
> 
> to
> 
> template <class T, class arch>
> class batch;
> 
> So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,
> you would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.
> neon/neon64 for ARM ISAs) and then access the batch size through the
> batch::size static property.

Adding this 'arch' parameter is a bit strange at first glance, given the 
purpose of an simd wrapper is to hide arch dependent code.
But as latest simd isa (sve, avx512) has much richer features than 
simply widening the data width, looks arch code is a must.
I think this change won't cause trouble to existing xsimd client code.

> 
> A few comments for discussion / investigation:
> 
> * Firstly, we will have to prepare ourselves to migrate to this new
> API in the future
> 
> * At some point, we will likely want to generate SIMD-variants of our
> C++ math kernels usable via dynamic dispatch for each different CPU
> support level. It would be beneficial to author as much code in an
> ISA-independent fashion that can be cross-compiled to generate binary
> code for each ISA. We should investigate whether the new approach in
> xsimd will provide what we need or if we need to take a different
> approach.
> 
> * We have some of our own dynamic dispatch code to enable runtime
> function pointer selection based on available SIMD levels. Can we
> benefit from any of the work that is happening in this xsimd refactor?

I think they have some overlaps. Runtime dispatch at xsimd level(simd 
code block) looks better than at kernel dispatch level, IIUC.

> 
> * We have some compute code (e.g. hash tables for aggregation / joins)
> that uses explicit AVX2 intrinsics — can some of this code be ported
> to use generic xsimd APIs or will we need to use a different
> fundamental algorithm design to yield maximum efficiency for each SIMD
> ISA?
> 
> Thanks,
> Wes
>