You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2022/07/06 15:34:17 UTC
Re: Arrow compute/dataset design doc missing

I definitely think that adding runtime SIMD dispatching for arithmetic
(and using xsimd to write generic kernels that can be cross-compiled
for different SIMD targets, i.e. AVX2, AVX512, NEON) functions is a
good idea and hopefully it will be pretty low hanging fruit (~a day or
two of work) to make applications faster across the board.

On Thu, Jun 9, 2022 at 1:45 PM Sasha Krassovsky
<kr...@gmail.com> wrote:
>
> To add on to Weston’s response, the only SIMD that will ever be generated for the kernels by compilers at the moment is with SSE4.2, it will not generate AVX2 as we have not set up the compiler flags to do that. Also the way the code is written doesn’t seem super easy to vectorize for the compiler, and needs a bit of massaging. I do plan to tackle this at some point after the per-kernel overhead work that Weston mentioned is complete.
>
> Sasha Krassovsky
>
> > On Jun 9, 2022, at 8:42 AM, Shawn Yang <sh...@gmail.com> wrote:
> >
> > I see, thanks. I'll do more tests and dive into more arrow compute code.
> >
> > Sent from my iPhone
> >
> >> On Jun 9, 2022, at 5:30 PM, Weston Pace <we...@gmail.com> wrote:
> >>
> >> 
> >>>
> >>> Hi, do you guys know which functions support vectorized SIMD in arrow compute?
> >>
> >> I don't know that anyone has done a fully systematic analysis of which
> >> kernels support and do not support SIMD at the moment.  The kernels
> >> are still in flux.  There is an active effort to reduce overhead[1]
> >> which is the top priority as this could possibly have more impact on
> >> performance than SIMD when running expressions involving multiple
> >> kernels across multiple threads.
> >>
> >>> I only found very little functions support vectorized SIMD:
> >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
> >>>
> >>> Does scalar operation support vectorized SIMD?
> >>
> >> A lack of explicit vectorization instructions does not mean a lack of
> >> SIMD support.  For many kernels we expect modern compilers to be smart
> >> enough to automatically implement vectorization as long as the data is
> >> provided in a vectorized fashion (e.g. columnar) and the kernel is
> >> simple enough.  For more complex kernels there are options such as
> >> xsimd but this hasn't yet been very thoroughly explored.  At the
> >> moment I'm not aware of anyone writing explicitly vectorized kernels
> >> as this tends to be rather hardware specific and have a small return
> >> on investment.  Instead, we benchmark regularly and have
> >> micro-optimized certain critical sections (e.g. some of the hash
> >> stuff).
> >>
> >>> I tested with numpy and found arrow is ten times slower:
> >>
> >> That result you posted appears to be 3.5x slower.  You might want to
> >> double check and ensure that Arrow was compiled with the appropriate
> >> architecture (the cmake files are generally good at figuring this out)
> >> but I wouldn't be too surprised if this was the case.  Some of this
> >> might be unavoidable.  For example, does numpy support null values (I
> >> don't know for sure but I seem to recall it does not)?  Some of this
> >> might be an inefficiency or overhead problem in Arrow-C++.  It is
> >> possible that the add kernel is not being vectorized correctly by the
> >> compiler but I don't think those numbers alone are enough proof of
> >> that.
> >>
> >> Performance can be quite tricky.  It is important for us but Arrow's
> >> compute functionality is still relatively new compared to numpy and
> >> work on performance is balanced with work on features.
> >>
> >> [1] https://lists.apache.org/thread/rh10ykcolt0gxydhgv4vxk2m7ktwx5mh
> >>
> >>> On Wed, Jun 8, 2022 at 11:08 PM Shawn Yang <sh...@gmail.com> wrote:
> >>>
> >>> Hi, do you guys know which functions support vectorized SIMD in arrow compute? After a quick look as arrow compute cpp code, I only found very little functions support vectorized SIMD:
> >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
> >>>
> >>> Does scalar operation support vectorized SIMD?
> >>>
> >>> I tested with numpy and found arrow is ten times slower:
> >>>
> >>> def test_multiply(rows=5000000):
> >>>   a = pa.array(list(range(rows, 0, -1)))
> >>>   b = pa.array(list(range(rows)))
> >>>   import pyarrow.compute as pc
> >>>
> >>>   print("arrow multiply took", timeit.timeit(
> >>>       lambda: pc.multiply(a, b), number=3))
> >>>   a = np.array(list(range(rows, 0, -1)))
> >>>   b = np.array(list(range(rows)))
> >>>   print("numpy multiply took", timeit.timeit(
> >>>       lambda: a * b, number=3))
> >>>   # arrow multiply took 0.14826057000000015
> >>>   # numpy multiply took 0.04047051300000071
> >>>
> >>>
> >>>> On Wed, May 25, 2022 at 10:09 PM Shawn Yang <sh...@gmail.com> wrote:
> >>>>
> >>>> I see, the key for multiple loop is to ensure the data can be hold in l2 cache, so that later
> >>>> calculation can process this batch without reading from the main memory, and we can record the exec stats for every batch , and do better local task scheduling based on those stats.  Thanks a lot. Morsels is new to me, very interesting ideas
> >>>> Sent from my iPhone
> >>>>
> >>>>> On May 25, 2022, at 7:23 AM, Weston Pace <we...@gmail.com> wrote:
> >>>>>
> >>>>> There are a few levels of loops.  Two at the moment and three in the
> >>>>> future.  Some are fused and some are not.  What we have right now is
> >>>>> early stages, is not ideal, and there are people investigating and
> >>>>> working on improvements.  I can speak a little bit about where we want
> >>>>> to go.  An example may be helpful.
> >>>>>
> >>>>> For example, given a filter "x < 100 && x > 0" we have something like
> >>>>> (this is an approximation of the work done by
> >>>>> arrow::compute::ExecuteScalarExpression and not actual code):
> >>>>>
> >>>>> ```
> >>>>> for batch_of_128k_rows in data:
> >>>>>  auto lt_one_hundred = less_than(batch_of_128k_rows, 100)
> >>>>>  auto gt_zero = greater_than(batch_of_128k_rows, 0)
> >>>>>  auto filter_pred = and(lt_one_hundred, gt_zero)
> >>>>>  consume(filter(batch_of_128k_rows, filter_pred))
> >>>>> ```
> >>>>>
> >>>>> There are two big things we need to fix here.  First,
> >>>>> `batch_of_128k_rows` is meant to be some percentage of one thread's
> >>>>> portion of the L3 cache.  This is a good unit of parallelism but it is
> >>>>> not ideal for processing because we'd rather use the L2 cache since we
> >>>>> are making three passes across `batch_of_128k_rows`.  Second, each of
> >>>>> those `auto ... =` lines is allocating new memory.  This is not ideal
> >>>>> because we'd like to avoid excess allocation if possible.
> >>>>>
> >>>>> To solve the first problem we are moving towards the "morsel/batch"
> >>>>> model[1].  This means we have two "batch" sizes.  The outer batch
> >>>>> (ironically, the morsel) is the largest and is the one used for
> >>>>> determining parallelism.  The inner batch should be smaller (size
> >>>>> based on L2).
> >>>>>
> >>>>> To solve the second problem a number of solutions have been proposed
> >>>>> (thread-local buffer pools, thread-local buffer stacks, etc.) and we
> >>>>> will hopefully adopt one at some point.  So the above code snippet
> >>>>> would hopefully become something like:
> >>>>>
> >>>>> ```
> >>>>> thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size, bool)
> >>>>> thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool)
> >>>>> thread_local auto filter_pred = allocate_array(l2_sized_batch_size, bool)
> >>>>> for batch_of_128k_rows in data:
> >>>>>  for l2_sized_batch in batch_of_128k_rows:
> >>>>>      less_than(l2_sized_batch, 100, &lt_one_hundred)
> >>>>>      greater_than(l2_sized_batch, 0, &gt_zero)
> >>>>>      and(lt_one_hundred, gt_zero, &filter_pred)
> >>>>>      consume(filter(l2_sized_batch, filter_pred))
> >>>>> ```
> >>>>>
> >>>>> There is still a fair amount of work to do before we get here but I
> >>>>> hope this gives you some idea of the direction we are headed.
> >>>>>
> >>>>> [1] https://db.in.tum.de/~leis/papers/morsels.pdf
> >>>>>
> >>>>>> On Tue, May 24, 2022 at 6:27 AM Shawn Yang <sh...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Ion, thank you for your reply which recaps  the history of arrow compute. Those links are very valuable for me to understand arrow compute internal. I took a quick for those documents and will take a deeper into those later. I have another question, does arrow compute supports loop fusion, which execute multiple vectorized operand in one loop? This is very common in dataframe comouting, our engine can extract those expressions into a dag/tree. If arrow computer support loop fusion，the performance would be very promising
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>>>> On May 24, 2022, at 4:49 AM, Ian Cook <ia...@ursacomputing.com> wrote:
> >>>>>>>
> >>>>>>> Hi Shawn,
> >>>>>>>
> >>>>>>> In March of 2021, when major work on the C++ query execution machinery
> >>>>>>> in Arrow was beginning, Wes sent a message [1] to the dev list and
> >>>>>>> linked to a doc [2] with some details about the planned design. A few
> >>>>>>> months later Neal sent an update [3] about this work. However those
> >>>>>>> documents are now somewhat out of date. More recently, Wes shared
> >>>>>>> another update [4] and linked to a doc [5] regarding task execution /
> >>>>>>> control flow / scheduling. However I think the best source of
> >>>>>>> information is the doc you linked to. The query execution work has
> >>>>>>> proceeded organically with many contributors, and efforts to document
> >>>>>>> the overall design in sufficient detail have not kept pace.
> >>>>>>>
> >>>>>>> Regarding benchmarks: There has been extensive work done using
> >>>>>>> Conbench [6] as part of the Arrow CI infrastructure to benchmark
> >>>>>>> commits, for purposes of avoiding / identifying performance
> >>>>>>> regressions and measuring efforts to improve performance. However I am
> >>>>>>> not aware of any efforts to produce and publicly share benchmarks for
> >>>>>>> the purpose of comparing performance vs. other query engines.
> >>>>>>>
> >>>>>>> There is a proposal [7] to give the name "Acero" to the Arrow C++
> >>>>>>> compute engine, so in the future you will likely see it referred to by
> >>>>>>> that name. I think that having a clearer name for this will motivate
> >>>>>>> more efforts to write and share more about it.
> >>>>>>>
> >>>>>>> Ian
> >>>>>>>
> >>>>>>> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0
> >>>>>>> [2] https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/
> >>>>>>> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1
> >>>>>>> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc
> >>>>>>> [5] https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/
> >>>>>>> [6] https://conbench.ursa.dev
> >>>>>>> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang <sh...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi, I'm considering using arrow compute as an execution kernel for our distributed dataframe framework. I already read the great doc: https://arrow.apache.org/docs/cpp/compute.html, but it is an usage doc. Is there any design doc, inside introduction or benchmarks for arrow compute so I can quickly understand how arrow compute works, what can be done and what should be done by it? Or I should just read the code in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> >>>>>>>>
> >>>>>>>>
>