You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/01 07:39:00 UTC

[GitHub] [arrow] jianxind opened a new pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

jianxind opened a new pull request #7314:
URL: https://github.com/apache/arrow/pull/7314


   1. Add SSE intrinsic path for aggregate sum dense, sparse part will be a PR later.
   2. Add build support to append the compiler flag for SIMD code file.
   3. Register the SSE version at runtime as the CPU feature.
   4. Also add a test case to cover the overflow on the data types which need a right shift
      on the Accumulator type.
   
   Signed-off-by: Frank Du <fr...@intel.com>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel

Posted by GitBox <gi...@apache.org>.

emkornfield commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-647081849


   @jianxind please see my e-mail.  I think we should try to split this PR up into runtime dispatch using slightly tweaked version of the current code without intrinsics.  I will take a closer look at what is being done for sparse, but if you aren't using BitBlockCounter we should incorporate that as a first step, so most sums can delegate to Dense


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643959192


   [AMD64 Ubuntu 18.04 C++ Benchmark (#111968)](https://ci.ursalabs.org/#builders/73/builds/74) builder has been succeeded.
   
   Revision: f0f0fc1c44a1e79970afea97a540bc44ffbc9d66
   
   ```diff
     ===========================  ==============  ===============  ========
     benchmark                    baseline        contender        change
     ===========================  ==============  ===============  ========
     SumKernelDouble/32768/10     8.514 GiB/sec   11.180 GiB/sec   31.311%
     SumKernelFloat/32768/100     4.818 GiB/sec   6.487 GiB/sec    34.653%
     SumKernelInt64/32768/10000   11.216 GiB/sec  17.700 GiB/sec   57.811%
   - SumKernelInt8/32768/10       1.516 GiB/sec   647.895 MiB/sec  -58.274%
     SumKernelDouble/32768/0      10.584 GiB/sec  16.200 GiB/sec   53.064%
     SumKernelDouble/32768/100    9.574 GiB/sec   11.165 GiB/sec   16.611%
     SumKernelInt64/32768/2       8.408 GiB/sec   11.962 GiB/sec   42.275%
     SumKernelInt64/32768/100     10.523 GiB/sec  11.937 GiB/sec   13.440%
   - SumKernelInt16/32768/10000   4.707 GiB/sec   1.679 GiB/sec    -64.334%
   - SumKernelInt16/32768/2       3.029 GiB/sec   1.681 GiB/sec    -44.494%
     SumKernelInt32/32768/2       5.340 GiB/sec   6.837 GiB/sec    28.049%
   - SumKernelInt8/32768/100      2.296 GiB/sec   648.265 MiB/sec  -72.424%
     SumKernelInt32/32768/10000   5.242 GiB/sec   12.554 GiB/sec   139.507%
   - SumKernelInt8/32768/1        1.670 GiB/sec   650.501 MiB/sec  -61.956%
     SumKernelInt16/32768/0       2.917 GiB/sec   8.116 GiB/sec    178.227%
   - SumKernelInt8/32768/2        1.633 GiB/sec   647.953 MiB/sec  -61.258%
   - SumKernelInt8/32768/10000    2.647 GiB/sec   648.361 MiB/sec  -76.083%
     SumKernelInt32/32768/0       5.237 GiB/sec   12.586 GiB/sec   140.334%
     SumKernelInt32/32768/10      6.134 GiB/sec   6.844 GiB/sec    11.577%
   - SumKernelInt16/32768/100     3.937 GiB/sec   1.677 GiB/sec    -57.406%
     SumKernelInt8/32768/0        1.772 GiB/sec   4.985 GiB/sec    181.275%
     SumKernelInt32/32768/1       5.466 GiB/sec   7.039 GiB/sec    28.774%
     SumKernelInt32/32768/100     7.199 GiB/sec   6.858 GiB/sec    -4.747%
     SumKernelInt64/32768/1       8.591 GiB/sec   12.356 GiB/sec   43.822%
     SumKernelFloat/32768/10000   6.387 GiB/sec   11.708 GiB/sec   83.320%
     SumKernelDouble/32768/1      8.292 GiB/sec   11.524 GiB/sec   38.977%
     SumKernelFloat/32768/0       6.367 GiB/sec   11.597 GiB/sec   82.144%
     SumKernelInt64/32768/10      9.388 GiB/sec   11.948 GiB/sec   27.258%
     SumKernelDouble/32768/2      7.300 GiB/sec   11.174 GiB/sec   53.070%
     SumKernelFloat/32768/10      4.043 GiB/sec   6.488 GiB/sec    60.492%
     SumKernelInt64/32768/0       11.229 GiB/sec  17.780 GiB/sec   58.336%
   - SumKernelInt16/32768/1       3.098 GiB/sec   1.696 GiB/sec    -45.248%
   - SumKernelInt16/32768/10      3.364 GiB/sec   1.678 GiB/sec    -50.122%
     SumKernelFloat/32768/2       2.474 GiB/sec   6.453 GiB/sec    160.789%
     SumKernelDouble/32768/10000  10.535 GiB/sec  16.165 GiB/sec   53.444%
     SumKernelFloat/32768/1       3.863 GiB/sec   6.663 GiB/sec    72.458%
     ===========================  ==============  ===============  ========
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-647334741


   > I think we also need a way of setting max runtime instruction set for runtime dispatch (apologies if there is one and I missed it)
   
   Thanks, currently no. But it can easily supported by hooking into CpuInfo::Init(),  tweak hardware_flags_ while checking the user ENV settings, something like "ARROW_RUNTIME_SIMD_LEVEL=none make unittest".
   
   For split PR without intrinsics, I will try to work this. Now I used some compiler pragma(clang attribute push or GCC push_options) to support the intrinsic API, it's defined to NULL for MSVC as MSVC supports directly using intrinsic API without compiler flag. Need find a new method for new ways without intrinsic, maybe something like set_source_files_properties can help here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645135876


   [AMD64 Ubuntu 18.04 C++ Benchmark (#112762)](https://ci.ursalabs.org/#builders/73/builds/79) builder has been succeeded.
   
   Revision: 525caea882fe49c0248932fff77df6bcd3f2f477
   
   ```diff
     ===========================  =============  ===============  ========
     benchmark                    baseline       contender        change
     ===========================  =============  ===============  ========
     SumKernelFloat/32768/10000   5.958 GiB/sec  10.208 GiB/sec   71.337%
     SumKernelInt64/32768/1       7.938 GiB/sec  10.783 GiB/sec   35.851%
     SumKernelInt32/32768/0       4.958 GiB/sec  10.832 GiB/sec   118.486%
     SumKernelInt64/32768/0       9.704 GiB/sec  14.764 GiB/sec   52.146%
     SumKernelFloat/32768/2       2.389 GiB/sec  5.767 GiB/sec    141.401%
   - SumKernelInt16/32768/2       2.953 GiB/sec  1.696 GiB/sec    -42.565%
     SumKernelDouble/32768/0      9.462 GiB/sec  13.540 GiB/sec   43.102%
     SumKernelInt32/32768/2       5.074 GiB/sec  6.403 GiB/sec    26.201%
   - SumKernelInt16/32768/10000   4.520 GiB/sec  1.693 GiB/sec    -62.552%
     SumKernelFloat/32768/100     4.644 GiB/sec  5.807 GiB/sec    25.051%
     SumKernelInt64/32768/10000   9.778 GiB/sec  14.784 GiB/sec   51.203%
     SumKernelDouble/32768/1      7.595 GiB/sec  10.131 GiB/sec   33.400%
     SumKernelFloat/32768/1       3.920 GiB/sec  5.938 GiB/sec    51.490%
     SumKernelInt64/32768/10      8.461 GiB/sec  10.548 GiB/sec   24.655%
   - SumKernelInt8/32768/10       1.680 GiB/sec  674.552 MiB/sec  -60.783%
     SumKernelDouble/32768/10     7.748 GiB/sec  9.813 GiB/sec    26.645%
   - SumKernelInt8/32768/1        1.643 GiB/sec  678.806 MiB/sec  -59.643%
     SumKernelInt8/32768/0        1.745 GiB/sec  4.709 GiB/sec    169.817%
     SumKernelInt32/32768/10000   4.959 GiB/sec  10.884 GiB/sec   119.497%
     SumKernelInt64/32768/100     9.294 GiB/sec  10.545 GiB/sec   13.466%
   - SumKernelInt8/32768/100      2.242 GiB/sec  674.145 MiB/sec  -70.638%
     SumKernelInt64/32768/2       7.674 GiB/sec  10.553 GiB/sec   37.529%
     SumKernelInt16/32768/0       2.850 GiB/sec  7.599 GiB/sec    166.670%
     SumKernelInt32/32768/10      5.754 GiB/sec  6.423 GiB/sec    11.627%
     SumKernelInt32/32768/1       5.213 GiB/sec  6.577 GiB/sec    26.172%
     SumKernelDouble/32768/100    8.731 GiB/sec  9.829 GiB/sec    12.578%
     SumKernelFloat/32768/0       5.960 GiB/sec  10.156 GiB/sec   70.394%
   - SumKernelInt16/32768/100     4.099 GiB/sec  1.694 GiB/sec    -58.674%
   - SumKernelInt8/32768/10000    2.580 GiB/sec  674.639 MiB/sec  -74.465%
     SumKernelDouble/32768/2      6.864 GiB/sec  9.912 GiB/sec    44.416%
     SumKernelInt32/32768/100     6.731 GiB/sec  6.439 GiB/sec    -4.334%
   - SumKernelInt8/32768/2        1.604 GiB/sec  674.265 MiB/sec  -58.959%
   - SumKernelInt16/32768/10      3.491 GiB/sec  1.694 GiB/sec    -51.473%
     SumKernelFloat/32768/10      3.924 GiB/sec  5.794 GiB/sec    47.673%
     SumKernelDouble/32768/10000  9.531 GiB/sec  13.404 GiB/sec   40.636%
   - SumKernelInt16/32768/1       3.021 GiB/sec  1.710 GiB/sec    -43.392%
     ===========================  =============  ===============  ========
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643115702


   [AMD64 Ubuntu 18.04 C++ Benchmark (#111432)](https://ci.ursalabs.org/#builders/73/builds/73) builder has been succeeded.
   
   Revision: 23aee3e5f6fc4c3fac7a859df7c52f79033c1c3c
   
   ```diff
     ===========================  ==============  ==============  ========
     benchmark                    baseline        contender       change
     ===========================  ==============  ==============  ========
     SumKernelFloat/32768/0       6.432 GiB/sec   11.724 GiB/sec  82.266%
     SumKernelFloat/32768/2       2.437 GiB/sec   5.917 GiB/sec   142.784%
   - SumKernelInt32/32768/100     7.333 GiB/sec   6.729 GiB/sec   -8.233%
     SumKernelInt8/32768/10       1.473 GiB/sec   1.904 GiB/sec   29.274%
     SumKernelDouble/32768/1      8.371 GiB/sec   9.245 GiB/sec   10.443%
     SumKernelInt16/32768/10000   4.770 GiB/sec   4.791 GiB/sec   0.442%
     SumKernelInt16/32768/1       3.122 GiB/sec   3.160 GiB/sec   1.234%
     SumKernelInt64/32768/10000   11.233 GiB/sec  17.690 GiB/sec  57.480%
   - SumKernelDouble/32768/100    9.800 GiB/sec   8.990 GiB/sec   -8.265%
     SumKernelInt32/32768/1       5.552 GiB/sec   6.909 GiB/sec   24.431%
   - SumKernelInt64/32768/100     10.468 GiB/sec  9.236 GiB/sec   -11.768%
     SumKernelInt64/32768/1       8.737 GiB/sec   9.467 GiB/sec   8.361%
     SumKernelFloat/32768/1       3.921 GiB/sec   6.064 GiB/sec   54.637%
     SumKernelFloat/32768/10      4.075 GiB/sec   5.926 GiB/sec   45.401%
     SumKernelFloat/32768/100     4.894 GiB/sec   5.928 GiB/sec   21.126%
     SumKernelInt16/32768/100     4.284 GiB/sec   4.452 GiB/sec   3.937%
     SumKernelInt32/32768/10000   5.280 GiB/sec   12.466 GiB/sec  136.108%
     SumKernelDouble/32768/0      10.742 GiB/sec  16.079 GiB/sec  49.674%
     SumKernelInt8/32768/10000    2.670 GiB/sec   2.675 GiB/sec   0.166%
     SumKernelInt64/32768/0       11.205 GiB/sec  17.720 GiB/sec  58.141%
     SumKernelInt64/32768/2       8.471 GiB/sec   9.229 GiB/sec   8.946%
     SumKernelInt16/32768/0       2.931 GiB/sec   8.214 GiB/sec   180.229%
     SumKernelInt8/32768/0        1.782 GiB/sec   5.000 GiB/sec   180.532%
     SumKernelDouble/32768/2      7.364 GiB/sec   8.942 GiB/sec   21.436%
     SumKernelInt8/32768/100      2.308 GiB/sec   2.424 GiB/sec   5.042%
     SumKernelDouble/32768/10     8.647 GiB/sec   9.001 GiB/sec   4.090%
     SumKernelInt16/32768/10      3.547 GiB/sec   3.610 GiB/sec   1.781%
     SumKernelInt8/32768/1        1.678 GiB/sec   1.686 GiB/sec   0.483%
     SumKernelFloat/32768/10000   6.445 GiB/sec   11.534 GiB/sec  78.960%
     SumKernelDouble/32768/10000  10.762 GiB/sec  16.131 GiB/sec  49.884%
     SumKernelInt64/32768/10      9.395 GiB/sec   9.240 GiB/sec   -1.652%
     SumKernelInt8/32768/2        1.640 GiB/sec   1.648 GiB/sec   0.468%
     SumKernelInt16/32768/2       3.052 GiB/sec   3.088 GiB/sec   1.177%
     SumKernelInt32/32768/0       5.283 GiB/sec   12.487 GiB/sec  136.345%
     SumKernelInt32/32768/10      6.192 GiB/sec   6.697 GiB/sec   8.159%
     SumKernelInt32/32768/2       5.438 GiB/sec   6.727 GiB/sec   23.720%
     ===========================  ==============  ==============  ========
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636711082


   @wesm Can you help to review the approach for runtime kernel path?
   
   For the build part, I introduced some compiler option(GCC push_options or clang attribute push) to help build the SIMD code without a common SIMD flag defined in CXX flags.
   
   For the SSE implementation, currently only dense part is implemented with intrinsic, I need more work for how to remove the invalid values.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-650865605


   Close this one as reference of intrinsic. I will try to work out a new approach without intrinsic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643950633


   @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636673371


   https://issues.apache.org/jira/browse/ARROW-8996


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639445690


   Certainly someone (or several people) would have to evaluate those libraries before deciding whether to integrate one. Preferably someone with an interest in SIMD optimizations (for example you and/or @cyb70289).
   
   But saying we don't want to use a helper library because the code might be suboptimal performance-wise and it's better to rewrite it from scratch doesn't sound like good engineering practice to me. Daily Arrow maintenance is done by a small set of developers and we should refrain from creating too much burden.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-638972317


   I agree that it becomes problematic (both from a code size, testing, and maintenance perspective) to have SSE/AVX versions of every kernel. Having optimized versions of some important kernels that are known to use a lot of CPU cycles in applications is not a bad idea, though. 
   
   SIMD versions of kernels could also be moved to a "plugin" shared library so that they can be packaged separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-641136788


   > Just find a document(https://dl.acm.org/doi/pdf/10.1145/3178433.3178435 PAGE 7, Table 6: Comparison of various SIMD wrappers.) Some SIMD helper(simdpp/xsimd) has performance issue at least on some workload.
   
   Interesting comparison, thank you. But that is a single workload and we don't know what the issue was precisely.
   
   By the way, this is what the paper says about the approach you're advocating:
   
   > The intrinsics or assembly approaches are non-portable, low-level solutions which target specific architectures. They offer maximum control to take advantage of instruction set specificities, and to fine tune register usage. However, it is quite difficult to develop and maintain a low-level code in the long run.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jedbrown commented on a change in pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jedbrown commented on a change in pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#discussion_r437844193



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_sse.cc
##########
@@ -0,0 +1,355 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_aggregate.h"
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+#include "arrow/compute/kernels/aggregate_internal.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/align_util.h"
+#include "arrow/util/simd.h"
+
+TARGET_CODE_START_SSE4_2
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+// ----------------------------------------------------------------------
+// Sum implementation for SSE
+
+// Each m128 stream handle 2 double/int64 accumulator type, one batch has 4 streams.
+static constexpr int kSseBatchStreams = 4;
+static constexpr int kSseStreamSize = sizeof(__m128d) / sizeof(double);
+static constexpr int kSseBatchSize = kSseBatchStreams * kSseStreamSize;
+
+// Default scalar version
+template <typename T, typename SumT>
+inline SumResult<SumT> SumDenseBatchSse(const T* values, int64_t num_batch) {
+  SumResult<SumT> sum_result;
+  SumT sum_streams[kSseBatchSize] = {0};
+
+  // Add the results by streams
+  for (int64_t batch = 0; batch < num_batch; batch++) {
+    for (int i = 0; i < kSseBatchSize; i++) {
+      sum_streams[i] += values[(batch * kSseBatchSize) + i];
+    }
+  }
+
+  // Aggregate the result streams
+  for (int i = 0; i < kSseBatchSize; i++) {
+    sum_result.sum += sum_streams[i];
+  }
+  sum_result.count = num_batch * kSseBatchSize;
+  return sum_result;
+}
+
+// Dense helper for accumulator type is same to data type
+#define SUM_DENSE_BATCH_SSE_DIRECT(Type, SumSimdType, SimdZeroFn, SimdLoadFn, SimdAddFn) \
+  template <>                                                                            \
+  inline SumResult<Type> SumDenseBatchSse(const Type* values, int64_t num_batch) {       \
+    SumResult<Type> sum_result;                                                          \
+    SumSimdType results_simd[kSseBatchStreams];                                          \
+    for (int i = 0; i < kSseBatchStreams; i++) {                                         \
+      results_simd[i] = SimdZeroFn();                                                    \
+    }                                                                                    \
+                                                                                         \
+    /* Add the values to result streams */                                               \
+    for (int64_t batch = 0; batch < num_batch; batch++) {                                \
+      for (int i = 0; i < kSseBatchStreams; i++) {                                       \
+        const auto src_simd =                                                            \
+            SimdLoadFn(&values[batch * kSseBatchSize + kSseStreamSize * i]);             \
+        results_simd[i] = SimdAddFn(src_simd, results_simd[i]);                          \
+      }                                                                                  \
+    }                                                                                    \

Review comment:
       You can use GCC pragmas to force more unrolling, but AVX512 vectorization gets you a ways despite the adder instruction latency. (My experience has been that GCC does an excellent job with intrinsics, often better than ICC, and tends to favor smaller code at comparable optimization levels. Clang tends to be more aggressive than GCC at unrolling, though I've often seen that not pay off.)
   
   The masked case is harder and I'm not aware of an idiomatic way to get GCC to vectorize it. We get some masked vectorized code with clang, though it isn't as clean as I'd hope for.
   ```c
   double sum1(int64_t n, const unsigned char *mask, const double *values) {
       double sum = 0;
       #pragma omp simd reduction(+:sum) collapse(2)
       for (int64_t i = 0; i < n/8; i++) {
           for (int64_t j = 0; j < 8; j++) {
               sum += mask[i] & (1 << j) ? values[i*8+j] : 0;
           }
       }
       return sum;
   }
   ```
   
   https://gcc.godbolt.org/z/84anmr




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637218125


   I'll try to review this in the next couple days


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind removed a comment on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind removed a comment on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636677517


   Benchmark data:
   
   Before:
   ```
   SumKernelFloat/32768/0         2.96 us         2.96 us       236912 bytes_per_second=10.3227G/s null_percent=0 size=32.768k
   SumKernelFloat/32768/1         4.88 us         4.88 us       143527 bytes_per_second=6.25439G/s null_percent=1 size=32.768k
   SumKernelFloat/32768/10        5.13 us         5.13 us       136839 bytes_per_second=5.95117G/s null_percent=10 size=32.768k
   SumKernelFloat/32768/50        7.82 us         7.81 us        87129 bytes_per_second=3.9054G/s null_percent=50 size=32.768k
   SumKernelDouble/32768/0        1.97 us         1.97 us       356786 bytes_per_second=15.4906G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/1        2.11 us         2.11 us       331511 bytes_per_second=14.4975G/s null_percent=1 size=32.768k
   SumKernelDouble/32768/10       2.39 us         2.38 us       291292 bytes_per_second=12.7966G/s null_percent=10 size=32.768k
   SumKernelDouble/32768/50       2.60 us         2.60 us       268800 bytes_per_second=11.7462G/s null_percent=50 size=32.768k
   SumKernelInt8/32768/0          11.7 us         11.7 us        59926 bytes_per_second=2.61569G/s null_percent=0 size=32.768k
   SumKernelInt8/32768/1          11.0 us         10.9 us        63640 bytes_per_second=2.78831G/s null_percent=1 size=32.768k
   SumKernelInt8/32768/10         14.8 us         14.8 us        46573 bytes_per_second=2.05848G/s null_percent=10 size=32.768k
   SumKernelInt8/32768/50         14.6 us         14.6 us        47840 bytes_per_second=2.08905G/s null_percent=50 size=32.768k
   SumKernelInt16/32768/0         7.06 us         7.06 us        99354 bytes_per_second=4.3245G/s null_percent=0 size=32.768k
   SumKernelInt16/32768/1         4.76 us         4.75 us       147305 bytes_per_second=6.41928G/s null_percent=1 size=32.768k
   SumKernelInt16/32768/10        5.64 us         5.63 us       122737 bytes_per_second=5.42002G/s null_percent=10 size=32.768k
   SumKernelInt16/32768/50        6.71 us         6.70 us       104192 bytes_per_second=4.55206G/s null_percent=50 size=32.768k
   SumKernelInt32/32768/0         3.92 us         3.92 us       178798 bytes_per_second=7.79042G/s null_percent=0 size=32.768k
   SumKernelInt32/32768/1         3.27 us         3.27 us       214296 bytes_per_second=9.332G/s null_percent=1 size=32.768k
   SumKernelInt32/32768/10        3.41 us         3.40 us       204944 bytes_per_second=8.9683G/s null_percent=10 size=32.768k
   SumKernelInt32/32768/50        3.69 us         3.69 us       190248 bytes_per_second=8.27705G/s null_percent=50 size=32.768k
   SumKernelInt64/32768/0         1.92 us         1.91 us       368662 bytes_per_second=15.9508G/s null_percent=0 size=32.768k
   SumKernelInt64/32768/1         2.05 us         2.05 us       340168 bytes_per_second=14.8684G/s null_percent=1 size=32.768k
   SumKernelInt64/32768/10        2.16 us         2.16 us       323585 bytes_per_second=14.1164G/s null_percent=10 size=32.768k
   SumKernelInt64/32768/50        2.41 us         2.41 us       291073 bytes_per_second=12.6873G/s null_percent=50 size=32.768k
   ```
   
   After:
   ```
   SumKernelFloat/32768/0         2.27 us         2.27 us       307928 bytes_per_second=13.438G/s null_percent=0 size=32.768k
   SumKernelFloat/32768/1         4.59 us         4.59 us       152827 bytes_per_second=6.6508G/s null_percent=1 size=32.768k
   SumKernelFloat/32768/10        5.30 us         5.29 us       132106 bytes_per_second=5.76658G/s null_percent=10 size=32.768k
   SumKernelFloat/32768/50        5.80 us         5.80 us       114378 bytes_per_second=5.26584G/s null_percent=50 size=32.768k
   SumKernelDouble/32768/0        1.42 us         1.42 us       494426 bytes_per_second=21.5265G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/1        2.12 us         2.12 us       330890 bytes_per_second=14.4268G/s null_percent=1 size=32.768k
   SumKernelDouble/32768/10       2.44 us         2.43 us       286310 bytes_per_second=12.5441G/s null_percent=10 size=32.768k
   SumKernelDouble/32768/50       2.72 us         2.71 us       257105 bytes_per_second=11.2507G/s null_percent=50 size=32.768k
   SumKernelInt8/32768/0          5.35 us         5.34 us       130751 bytes_per_second=5.71315G/s null_percent=0 size=32.768k
   SumKernelInt8/32768/1          9.80 us         9.79 us        71384 bytes_per_second=3.11589G/s null_percent=1 size=32.768k
   SumKernelInt8/32768/10         13.9 us         13.9 us        49729 bytes_per_second=2.19116G/s null_percent=10 size=32.768k
   SumKernelInt8/32768/50         12.5 us         12.5 us        55929 bytes_per_second=2.43479G/s null_percent=50 size=32.768k
   SumKernelInt16/32768/0         3.20 us         3.19 us       218923 bytes_per_second=9.55594G/s null_percent=0 size=32.768k
   SumKernelInt16/32768/1         5.31 us         5.31 us       131394 bytes_per_second=5.75174G/s null_percent=1 size=32.768k
   SumKernelInt16/32768/10        6.20 us         6.19 us       113037 bytes_per_second=4.92965G/s null_percent=10 size=32.768k
   SumKernelInt16/32768/50        7.25 us         7.24 us        96604 bytes_per_second=4.21535G/s null_percent=50 size=32.768k
   SumKernelInt32/32768/0         2.18 us         2.18 us       321572 bytes_per_second=14.0037G/s null_percent=0 size=32.768k
   SumKernelInt32/32768/1         3.32 us         3.32 us       209911 bytes_per_second=9.18857G/s null_percent=1 size=32.768k
   SumKernelInt32/32768/10        3.59 us         3.58 us       195106 bytes_per_second=8.51472G/s null_percent=10 size=32.768k
   SumKernelInt32/32768/50        3.83 us         3.82 us       182739 bytes_per_second=7.98056G/s null_percent=50 size=32.768k
   SumKernelInt64/32768/0         1.37 us         1.37 us       514237 bytes_per_second=22.3564G/s null_percent=0 size=32.768k
   SumKernelInt64/32768/1         2.09 us         2.09 us       333678 bytes_per_second=14.5962G/s null_percent=1 size=32.768k
   SumKernelInt64/32768/10        2.18 us         2.18 us       320094 bytes_per_second=13.9904G/s null_percent=10 size=32.768k
   SumKernelInt64/32768/50        2.41 us         2.40 us       289766 bytes_per_second=12.6907G/s null_percent=50 size=32.768k
   ```
   
   All dense part of data types has some improvements , ex Double jump to 21.5265G/s from 15.4906G/s.
   
   The sparse parts I will look into later as it need some additional to remove the invalid value before passing to the SIMD add operations, it need some shuffle op to replace the invalid value to zero.
   
   Also the dense part can be speed up again if using AVX2/AVX512 which is a later job also.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645130206


   @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636677517


   Benchmark data:
   
   Before:
   ```
   SumKernelFloat/32768/0         2.96 us         2.96 us       236912 bytes_per_second=10.3227G/s null_percent=0 size=32.768k
   SumKernelFloat/32768/1         4.88 us         4.88 us       143527 bytes_per_second=6.25439G/s null_percent=1 size=32.768k
   SumKernelFloat/32768/10        5.13 us         5.13 us       136839 bytes_per_second=5.95117G/s null_percent=10 size=32.768k
   SumKernelFloat/32768/50        7.82 us         7.81 us        87129 bytes_per_second=3.9054G/s null_percent=50 size=32.768k
   SumKernelDouble/32768/0        1.97 us         1.97 us       356786 bytes_per_second=15.4906G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/1        2.11 us         2.11 us       331511 bytes_per_second=14.4975G/s null_percent=1 size=32.768k
   SumKernelDouble/32768/10       2.39 us         2.38 us       291292 bytes_per_second=12.7966G/s null_percent=10 size=32.768k
   SumKernelDouble/32768/50       2.60 us         2.60 us       268800 bytes_per_second=11.7462G/s null_percent=50 size=32.768k
   SumKernelInt8/32768/0          11.7 us         11.7 us        59926 bytes_per_second=2.61569G/s null_percent=0 size=32.768k
   SumKernelInt8/32768/1          11.0 us         10.9 us        63640 bytes_per_second=2.78831G/s null_percent=1 size=32.768k
   SumKernelInt8/32768/10         14.8 us         14.8 us        46573 bytes_per_second=2.05848G/s null_percent=10 size=32.768k
   SumKernelInt8/32768/50         14.6 us         14.6 us        47840 bytes_per_second=2.08905G/s null_percent=50 size=32.768k
   SumKernelInt16/32768/0         7.06 us         7.06 us        99354 bytes_per_second=4.3245G/s null_percent=0 size=32.768k
   SumKernelInt16/32768/1         4.76 us         4.75 us       147305 bytes_per_second=6.41928G/s null_percent=1 size=32.768k
   SumKernelInt16/32768/10        5.64 us         5.63 us       122737 bytes_per_second=5.42002G/s null_percent=10 size=32.768k
   SumKernelInt16/32768/50        6.71 us         6.70 us       104192 bytes_per_second=4.55206G/s null_percent=50 size=32.768k
   SumKernelInt32/32768/0         3.92 us         3.92 us       178798 bytes_per_second=7.79042G/s null_percent=0 size=32.768k
   SumKernelInt32/32768/1         3.27 us         3.27 us       214296 bytes_per_second=9.332G/s null_percent=1 size=32.768k
   SumKernelInt32/32768/10        3.41 us         3.40 us       204944 bytes_per_second=8.9683G/s null_percent=10 size=32.768k
   SumKernelInt32/32768/50        3.69 us         3.69 us       190248 bytes_per_second=8.27705G/s null_percent=50 size=32.768k
   SumKernelInt64/32768/0         1.92 us         1.91 us       368662 bytes_per_second=15.9508G/s null_percent=0 size=32.768k
   SumKernelInt64/32768/1         2.05 us         2.05 us       340168 bytes_per_second=14.8684G/s null_percent=1 size=32.768k
   SumKernelInt64/32768/10        2.16 us         2.16 us       323585 bytes_per_second=14.1164G/s null_percent=10 size=32.768k
   SumKernelInt64/32768/50        2.41 us         2.41 us       291073 bytes_per_second=12.6873G/s null_percent=50 size=32.768k
   ```
   
   After:
   ```
   SumKernelFloat/32768/0         2.27 us         2.27 us       307928 bytes_per_second=13.438G/s null_percent=0 size=32.768k
   SumKernelFloat/32768/1         4.59 us         4.59 us       152827 bytes_per_second=6.6508G/s null_percent=1 size=32.768k
   SumKernelFloat/32768/10        5.30 us         5.29 us       132106 bytes_per_second=5.76658G/s null_percent=10 size=32.768k
   SumKernelFloat/32768/50        5.80 us         5.80 us       114378 bytes_per_second=5.26584G/s null_percent=50 size=32.768k
   SumKernelDouble/32768/0        1.42 us         1.42 us       494426 bytes_per_second=21.5265G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/1        2.12 us         2.12 us       330890 bytes_per_second=14.4268G/s null_percent=1 size=32.768k
   SumKernelDouble/32768/10       2.44 us         2.43 us       286310 bytes_per_second=12.5441G/s null_percent=10 size=32.768k
   SumKernelDouble/32768/50       2.72 us         2.71 us       257105 bytes_per_second=11.2507G/s null_percent=50 size=32.768k
   SumKernelInt8/32768/0          5.35 us         5.34 us       130751 bytes_per_second=5.71315G/s null_percent=0 size=32.768k
   SumKernelInt8/32768/1          9.80 us         9.79 us        71384 bytes_per_second=3.11589G/s null_percent=1 size=32.768k
   SumKernelInt8/32768/10         13.9 us         13.9 us        49729 bytes_per_second=2.19116G/s null_percent=10 size=32.768k
   SumKernelInt8/32768/50         12.5 us         12.5 us        55929 bytes_per_second=2.43479G/s null_percent=50 size=32.768k
   SumKernelInt16/32768/0         3.20 us         3.19 us       218923 bytes_per_second=9.55594G/s null_percent=0 size=32.768k
   SumKernelInt16/32768/1         5.31 us         5.31 us       131394 bytes_per_second=5.75174G/s null_percent=1 size=32.768k
   SumKernelInt16/32768/10        6.20 us         6.19 us       113037 bytes_per_second=4.92965G/s null_percent=10 size=32.768k
   SumKernelInt16/32768/50        7.25 us         7.24 us        96604 bytes_per_second=4.21535G/s null_percent=50 size=32.768k
   SumKernelInt32/32768/0         2.18 us         2.18 us       321572 bytes_per_second=14.0037G/s null_percent=0 size=32.768k
   SumKernelInt32/32768/1         3.32 us         3.32 us       209911 bytes_per_second=9.18857G/s null_percent=1 size=32.768k
   SumKernelInt32/32768/10        3.59 us         3.58 us       195106 bytes_per_second=8.51472G/s null_percent=10 size=32.768k
   SumKernelInt32/32768/50        3.83 us         3.82 us       182739 bytes_per_second=7.98056G/s null_percent=50 size=32.768k
   SumKernelInt64/32768/0         1.37 us         1.37 us       514237 bytes_per_second=22.3564G/s null_percent=0 size=32.768k
   SumKernelInt64/32768/1         2.09 us         2.09 us       333678 bytes_per_second=14.5962G/s null_percent=1 size=32.768k
   SumKernelInt64/32768/10        2.18 us         2.18 us       320094 bytes_per_second=13.9904G/s null_percent=10 size=32.768k
   SumKernelInt64/32768/50        2.41 us         2.40 us       289766 bytes_per_second=12.6907G/s null_percent=50 size=32.768k
   ```
   
   All dense part of data types has some improvements , ex Double jump to 21.5265G/s from 15.4906G/s.
   
   The sparse parts I will look into later as it need some additional to remove the invalid value before passing to the SIMD add operations, it need some shuffle op to replace the invalid value to zero.
   
   Also the dense part can be speed up again if using AVX2/AVX512 which is a later job also.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639009091


   ```
   no such option: --cc
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645294042


   The numbers looks good for sparse part on the benchmark device also now, even though locally my device can get more better speed up thanks to bigger L2/L3 size. I don't know if it's a good direction considering compiler can do less things on the bitmap based. 
   
   @emkornfield And others, thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637216678


   [AMD64 Ubuntu 18.04 C++ Benchmark (#108899)](https://ci.ursalabs.org/#builders/73/builds/70) builder has been succeeded.
   
   Revision: d789801a5e83b9717b8f36d72189198f5527b757
   
   ```diff
     ========================  ==============  ==============  ========
     benchmark                 baseline        contender       change
     ========================  ==============  ==============  ========
     SumKernelInt64/32768/1    10.639 GiB/sec  10.196 GiB/sec  -4.168%
     SumKernelInt8/32768/10    1.490 GiB/sec   1.445 GiB/sec   -3.007%
     SumKernelFloat/32768/0    6.407 GiB/sec   8.436 GiB/sec   31.670%
     SumKernelInt32/32768/0    5.307 GiB/sec   9.227 GiB/sec   73.863%
     SumKernelDouble/32768/10  8.489 GiB/sec   8.436 GiB/sec   -0.616%
     SumKernelInt32/32768/50   5.451 GiB/sec   5.242 GiB/sec   -3.841%
     SumKernelDouble/32768/1   9.690 GiB/sec   9.568 GiB/sec   -1.255%
     SumKernelInt8/32768/0     1.782 GiB/sec   3.726 GiB/sec   109.147%
   - SumKernelFloat/32768/1    4.899 GiB/sec   4.583 GiB/sec   -6.447%
   - SumKernelInt8/32768/1     2.111 GiB/sec   1.917 GiB/sec   -9.222%
     SumKernelInt32/32768/1    7.389 GiB/sec   7.281 GiB/sec   -1.470%
   - SumKernelInt16/32768/1    4.350 GiB/sec   3.855 GiB/sec   -11.363%
     SumKernelDouble/32768/0   10.647 GiB/sec  14.048 GiB/sec  31.949%
     SumKernelInt8/32768/50    1.641 GiB/sec   1.580 GiB/sec   -3.726%
     SumKernelFloat/32768/50   2.518 GiB/sec   2.545 GiB/sec   1.052%
     SumKernelInt16/32768/0    2.953 GiB/sec   6.038 GiB/sec   104.455%
     SumKernelInt64/32768/50   8.469 GiB/sec   8.439 GiB/sec   -0.349%
     SumKernelInt64/32768/0    11.243 GiB/sec  15.118 GiB/sec  34.462%
     SumKernelInt16/32768/50   3.070 GiB/sec   2.960 GiB/sec   -3.604%
     SumKernelInt64/32768/10   9.242 GiB/sec   9.336 GiB/sec   1.020%
   - SumKernelInt16/32768/10   3.638 GiB/sec   3.390 GiB/sec   -6.834%
     SumKernelInt32/32768/10   6.253 GiB/sec   6.012 GiB/sec   -3.844%
     SumKernelFloat/32768/10   4.111 GiB/sec   3.935 GiB/sec   -4.279%
     SumKernelDouble/32768/50  7.446 GiB/sec   7.455 GiB/sec   0.133%
     ========================  ==============  ==============  ========
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-638970620


   This looks generally quite complicated. If we need 500 additional lines of code to micro-optimize the Sum kernel for a single SIMD instruction set (nevermind that we may also want versions for AVX2, Neon, SVE, and whatnot), things will quickly get out of hand.
   
   If we want to go the way of per-kernel SIMD optimizations, it may be useful to investigate SIMD helper libraries (such as [libsimdpp](https://github.com/p12tic/libsimdpp), [xsimd](https://xsimd.readthedocs.io/en/latest/)...).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-641109277


   Just find a document(https://dl.acm.org/doi/pdf/10.1145/3178433.3178435 PAGE 7, Table 6: Comparison of various SIMD wrappers.) Some SIMD helper(simdpp/xsimd) has performance issue  at least on some workload.
   
   Another thing is most SIMD helpers has no runtime support, it means we still has to build same code(if we can find a common code path for one function) many times on arrow itself for the runtime capacity.
   
   And I'm working on the sparse part for aggregate sum recently, the data flow is total different for AVX2/AVX512. AVX512 has _mm512_mask_add_pd (__m512d src, __mmask8 k, __m512d a, __m512d b) support that it can SIMD add the results directly on the valid bit map. For AVX2, it has to use a lookup table mask with SIMD and operation to zero the invalid values before passing to SIMD add. The difference is applied to other future SIMD func also as all arrow data represented with valid bit map.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on a change in pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on a change in pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#discussion_r437804878



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_sse.cc
##########
@@ -0,0 +1,355 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_aggregate.h"
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+#include "arrow/compute/kernels/aggregate_internal.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/align_util.h"
+#include "arrow/util/simd.h"
+
+TARGET_CODE_START_SSE4_2
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+// ----------------------------------------------------------------------
+// Sum implementation for SSE
+
+// Each m128 stream handle 2 double/int64 accumulator type, one batch has 4 streams.
+static constexpr int kSseBatchStreams = 4;
+static constexpr int kSseStreamSize = sizeof(__m128d) / sizeof(double);
+static constexpr int kSseBatchSize = kSseBatchStreams * kSseStreamSize;
+
+// Default scalar version
+template <typename T, typename SumT>
+inline SumResult<SumT> SumDenseBatchSse(const T* values, int64_t num_batch) {
+  SumResult<SumT> sum_result;
+  SumT sum_streams[kSseBatchSize] = {0};
+
+  // Add the results by streams
+  for (int64_t batch = 0; batch < num_batch; batch++) {
+    for (int i = 0; i < kSseBatchSize; i++) {
+      sum_streams[i] += values[(batch * kSseBatchSize) + i];
+    }
+  }
+
+  // Aggregate the result streams
+  for (int i = 0; i < kSseBatchSize; i++) {
+    sum_result.sum += sum_streams[i];
+  }
+  sum_result.count = num_batch * kSseBatchSize;
+  return sum_result;
+}
+
+// Dense helper for accumulator type is same to data type
+#define SUM_DENSE_BATCH_SSE_DIRECT(Type, SumSimdType, SimdZeroFn, SimdLoadFn, SimdAddFn) \
+  template <>                                                                            \
+  inline SumResult<Type> SumDenseBatchSse(const Type* values, int64_t num_batch) {       \
+    SumResult<Type> sum_result;                                                          \
+    SumSimdType results_simd[kSseBatchStreams];                                          \
+    for (int i = 0; i < kSseBatchStreams; i++) {                                         \
+      results_simd[i] = SimdZeroFn();                                                    \
+    }                                                                                    \
+                                                                                         \
+    /* Add the values to result streams */                                               \
+    for (int64_t batch = 0; batch < num_batch; batch++) {                                \
+      for (int i = 0; i < kSseBatchStreams; i++) {                                       \
+        const auto src_simd =                                                            \
+            SimdLoadFn(&values[batch * kSseBatchSize + kSseStreamSize * i]);             \
+        results_simd[i] = SimdAddFn(src_simd, results_simd[i]);                          \
+      }                                                                                  \
+    }                                                                                    \

Review comment:
       Thanks, the vectorize code by clang is really good. But seems the GCC part only use zmm0 which may limit the performance. This PR only aims on dense part, the final target is expanding the support to sparse part also, do you know if sparse can applied similar approach also? Example scalar code like below
   
     inline T MaskedValue(bool valid, T value) const { return valid ? value : 0; }
         for (size_t i = 0; i < 8; i++) {
           local.sum += MaskedValue(bits & (1U << i), values[i]);
         }
         local.count += BitUtil::kBytePopcount[bits];




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643106609


   @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637212463


   @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639285090


   > This looks generally quite complicated. If we need 500 additional lines of code to micro-optimize the Sum kernel for a single SIMD instruction set (nevermind that we may also want versions for AVX2, Neon, SVE, and whatnot), things will quickly get out of hand.
   > 
   > If we want to go the way of per-kernel SIMD optimizations, it may be useful to investigate SIMD helper libraries (such as [libsimdpp](https://github.com/p12tic/libsimdpp), [xsimd](https://xsimd.readthedocs.io/en/latest/)...).
   
   libsimdpp says it has a runtime dispatch support based on function basic. But I don't know if the quality of these libs is perfect enough to use and it was qualified in every spec, it will make the debug things more hard for me if there's any bug there. Another concern is the performance, will the wrapper introduce extra cost? And usually the SIMD code/parameters is designed carefully for each architecture to get best performance, I suspect there's a common code can fit all target.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645246111


   @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] fsaintjacques commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

fsaintjacques commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639009079


   @ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --cc=clang --cxx=clang++


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm edited a comment on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

wesm edited a comment on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637218125


   I'll try to review this in the next couple days. @pitrou may be able to help also


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jedbrown commented on a change in pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

jedbrown commented on a change in pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#discussion_r437479652



##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_sse.cc
##########
@@ -0,0 +1,355 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_aggregate.h"
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+#include "arrow/compute/kernels/aggregate_internal.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/align_util.h"
+#include "arrow/util/simd.h"
+
+TARGET_CODE_START_SSE4_2
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+// ----------------------------------------------------------------------
+// Sum implementation for SSE
+
+// Each m128 stream handle 2 double/int64 accumulator type, one batch has 4 streams.
+static constexpr int kSseBatchStreams = 4;
+static constexpr int kSseStreamSize = sizeof(__m128d) / sizeof(double);
+static constexpr int kSseBatchSize = kSseBatchStreams * kSseStreamSize;
+
+// Default scalar version
+template <typename T, typename SumT>
+inline SumResult<SumT> SumDenseBatchSse(const T* values, int64_t num_batch) {
+  SumResult<SumT> sum_result;
+  SumT sum_streams[kSseBatchSize] = {0};
+
+  // Add the results by streams
+  for (int64_t batch = 0; batch < num_batch; batch++) {
+    for (int i = 0; i < kSseBatchSize; i++) {
+      sum_streams[i] += values[(batch * kSseBatchSize) + i];
+    }
+  }
+
+  // Aggregate the result streams
+  for (int i = 0; i < kSseBatchSize; i++) {
+    sum_result.sum += sum_streams[i];
+  }
+  sum_result.count = num_batch * kSseBatchSize;
+  return sum_result;
+}
+
+// Dense helper for accumulator type is same to data type
+#define SUM_DENSE_BATCH_SSE_DIRECT(Type, SumSimdType, SimdZeroFn, SimdLoadFn, SimdAddFn) \
+  template <>                                                                            \
+  inline SumResult<Type> SumDenseBatchSse(const Type* values, int64_t num_batch) {       \
+    SumResult<Type> sum_result;                                                          \
+    SumSimdType results_simd[kSseBatchStreams];                                          \
+    for (int i = 0; i < kSseBatchStreams; i++) {                                         \
+      results_simd[i] = SimdZeroFn();                                                    \
+    }                                                                                    \
+                                                                                         \
+    /* Add the values to result streams */                                               \
+    for (int64_t batch = 0; batch < num_batch; batch++) {                                \
+      for (int i = 0; i < kSseBatchStreams; i++) {                                       \
+        const auto src_simd =                                                            \
+            SimdLoadFn(&values[batch * kSseBatchSize + kSseStreamSize * i]);             \
+        results_simd[i] = SimdAddFn(src_simd, results_simd[i]);                          \
+      }                                                                                  \
+    }                                                                                    \

Review comment:
       Have you considered using OpenMP SIMD (i.e., `-fopenmp-simd`, which does not do anything with threads or require linking OpenMP)?
   ```c
   double sum1(int64_t n, const double *values) {
       double sum = 0;
       #pragma omp simd reduction(+:sum)
       for (int64_t i = 0; i < n; i++)
           sum += values[i];
       return sum;
   }
   ```
   produces good vectorized code (probably excessively unrolled by clang)
   ```
           vaddpd  zmm0, zmm0, zmmword ptr [rsi + 8*rdx]
           vaddpd  zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 64]
           vaddpd  zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 128]
           vaddpd  zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 192]
           vaddpd  zmm0, zmm0, zmmword ptr [rsi + 8*rdx + 256]
           vaddpd  zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 320]
           vaddpd  zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 384]
           vaddpd  zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 448]
           vaddpd  zmm0, zmm0, zmmword ptr [rsi + 8*rdx + 512]
           vaddpd  zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 576]
           vaddpd  zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 640]
           vaddpd  zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 704]
           vaddpd  zmm0, zmm0, zmmword ptr [rsi + 8*rdx + 768]
           vaddpd  zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 832]
           vaddpd  zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 896]
           vaddpd  zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 960]
           .loc    1 6 5                   # ./example.c:6:5
           sub     rdx, -128
           add     rcx, -4
           jne     .LBB0_7
   ```
   https://gcc.godbolt.org/z/6mD6oy




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-638885045


   I'll look at this today


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645256638


   [AMD64 Ubuntu 18.04 C++ Benchmark (#112803)](https://ci.ursalabs.org/#builders/73/builds/80) builder has been succeeded.
   
   Revision: 665c6cdb30f0ec372fa71de3e47bd5951183fe7d
   
   ```diff
     ===========================  =============  ==============  ========
     benchmark                    baseline       contender       change
     ===========================  =============  ==============  ========
     SumKernelDouble/32768/2      6.910 GiB/sec  10.005 GiB/sec  44.801%
     SumKernelInt16/32768/2       2.952 GiB/sec  6.441 GiB/sec   118.182%
     SumKernelInt32/32768/2       5.090 GiB/sec  7.097 GiB/sec   39.433%
     SumKernelDouble/32768/0      9.551 GiB/sec  13.401 GiB/sec  40.317%
     SumKernelDouble/32768/10000  9.572 GiB/sec  13.236 GiB/sec  38.276%
     SumKernelInt64/32768/1       7.936 GiB/sec  10.750 GiB/sec  35.453%
     SumKernelInt16/32768/10      3.493 GiB/sec  6.395 GiB/sec   83.081%
     SumKernelFloat/32768/0       5.966 GiB/sec  9.990 GiB/sec   67.445%
     SumKernelInt8/32768/1        1.641 GiB/sec  2.732 GiB/sec   66.423%
     SumKernelInt64/32768/0       9.878 GiB/sec  14.459 GiB/sec  46.376%
     SumKernelFloat/32768/100     4.616 GiB/sec  6.404 GiB/sec   38.746%
     SumKernelInt64/32768/10000   9.829 GiB/sec  14.577 GiB/sec  48.310%
     SumKernelInt16/32768/100     4.090 GiB/sec  6.389 GiB/sec   56.214%
     SumKernelDouble/32768/10     7.794 GiB/sec  9.882 GiB/sec   26.798%
     SumKernelInt16/32768/0       2.845 GiB/sec  7.555 GiB/sec   165.554%
     SumKernelFloat/32768/10      3.928 GiB/sec  6.381 GiB/sec   62.441%
     SumKernelFloat/32768/10000   5.946 GiB/sec  9.947 GiB/sec   67.286%
     SumKernelInt32/32768/1       5.208 GiB/sec  7.282 GiB/sec   39.809%
     SumKernelInt8/32768/100      2.252 GiB/sec  2.654 GiB/sec   17.841%
     SumKernelInt8/32768/2        1.607 GiB/sec  2.656 GiB/sec   65.290%
     SumKernelInt32/32768/10      5.778 GiB/sec  7.096 GiB/sec   22.813%
     SumKernelInt32/32768/100     6.732 GiB/sec  7.098 GiB/sec   5.427%
     SumKernelFloat/32768/2       2.386 GiB/sec  6.396 GiB/sec   168.068%
     SumKernelInt8/32768/10000    2.586 GiB/sec  2.652 GiB/sec   2.561%
     SumKernelInt64/32768/100     9.319 GiB/sec  10.627 GiB/sec  14.034%
     SumKernelDouble/32768/100    8.789 GiB/sec  9.860 GiB/sec   12.180%
     SumKernelInt16/32768/10000   4.519 GiB/sec  6.381 GiB/sec   41.217%
     SumKernelInt8/32768/0        1.745 GiB/sec  4.701 GiB/sec   169.458%
     SumKernelInt32/32768/10000   4.953 GiB/sec  10.825 GiB/sec  118.563%
     SumKernelInt64/32768/10      8.571 GiB/sec  10.629 GiB/sec  24.013%
     SumKernelInt8/32768/10       1.687 GiB/sec  2.656 GiB/sec   57.429%
     SumKernelInt64/32768/2       7.717 GiB/sec  10.658 GiB/sec  38.109%
     SumKernelInt32/32768/0       4.961 GiB/sec  10.791 GiB/sec  117.521%
     SumKernelDouble/32768/1      7.669 GiB/sec  10.079 GiB/sec  31.424%
     SumKernelInt16/32768/1       3.018 GiB/sec  6.589 GiB/sec   118.362%
     SumKernelFloat/32768/1       3.928 GiB/sec  6.600 GiB/sec   68.035%
     ===========================  =============  ==============  ========
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind closed pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel

Posted by GitBox <gi...@apache.org>.

jianxind closed pull request #7314:
URL: https://github.com/apache/arrow/pull/7314


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel

Posted by GitBox <gi...@apache.org>.

emkornfield commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-647228634


   I think we also need a way of setting max runtime instruction set for runtime dispatch (apologies if there is one and I missed it)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org