You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/01 07:39:00 UTC
[GitHub] [arrow] jianxind opened a new pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
jianxind opened a new pull request #7314:
URL: https://github.com/apache/arrow/pull/7314
1. Add SSE intrinsic path for aggregate sum dense, sparse part will be a PR later.
2. Add build support to append the compiler flag for SIMD code file.
3. Register the SSE version at runtime as the CPU feature.
4. Also add a test case to cover the overflow on the data types which need a right shift
on the Accumulator type.
Signed-off-by: Frank Du <fr...@intel.com>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel
Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-647081849
@jianxind please see my e-mail. I think we should try to split this PR up into runtime dispatch using slightly tweaked version of the current code without intrinsics. I will take a closer look at what is being done for sparse, but if you aren't using BitBlockCounter we should incorporate that as a first step, so most sums can delegate to Dense
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643959192
[AMD64 Ubuntu 18.04 C++ Benchmark (#111968)](https://ci.ursalabs.org/#builders/73/builds/74) builder has been succeeded.
Revision: f0f0fc1c44a1e79970afea97a540bc44ffbc9d66
```diff
=========================== ============== =============== ========
benchmark baseline contender change
=========================== ============== =============== ========
SumKernelDouble/32768/10 8.514 GiB/sec 11.180 GiB/sec 31.311%
SumKernelFloat/32768/100 4.818 GiB/sec 6.487 GiB/sec 34.653%
SumKernelInt64/32768/10000 11.216 GiB/sec 17.700 GiB/sec 57.811%
- SumKernelInt8/32768/10 1.516 GiB/sec 647.895 MiB/sec -58.274%
SumKernelDouble/32768/0 10.584 GiB/sec 16.200 GiB/sec 53.064%
SumKernelDouble/32768/100 9.574 GiB/sec 11.165 GiB/sec 16.611%
SumKernelInt64/32768/2 8.408 GiB/sec 11.962 GiB/sec 42.275%
SumKernelInt64/32768/100 10.523 GiB/sec 11.937 GiB/sec 13.440%
- SumKernelInt16/32768/10000 4.707 GiB/sec 1.679 GiB/sec -64.334%
- SumKernelInt16/32768/2 3.029 GiB/sec 1.681 GiB/sec -44.494%
SumKernelInt32/32768/2 5.340 GiB/sec 6.837 GiB/sec 28.049%
- SumKernelInt8/32768/100 2.296 GiB/sec 648.265 MiB/sec -72.424%
SumKernelInt32/32768/10000 5.242 GiB/sec 12.554 GiB/sec 139.507%
- SumKernelInt8/32768/1 1.670 GiB/sec 650.501 MiB/sec -61.956%
SumKernelInt16/32768/0 2.917 GiB/sec 8.116 GiB/sec 178.227%
- SumKernelInt8/32768/2 1.633 GiB/sec 647.953 MiB/sec -61.258%
- SumKernelInt8/32768/10000 2.647 GiB/sec 648.361 MiB/sec -76.083%
SumKernelInt32/32768/0 5.237 GiB/sec 12.586 GiB/sec 140.334%
SumKernelInt32/32768/10 6.134 GiB/sec 6.844 GiB/sec 11.577%
- SumKernelInt16/32768/100 3.937 GiB/sec 1.677 GiB/sec -57.406%
SumKernelInt8/32768/0 1.772 GiB/sec 4.985 GiB/sec 181.275%
SumKernelInt32/32768/1 5.466 GiB/sec 7.039 GiB/sec 28.774%
SumKernelInt32/32768/100 7.199 GiB/sec 6.858 GiB/sec -4.747%
SumKernelInt64/32768/1 8.591 GiB/sec 12.356 GiB/sec 43.822%
SumKernelFloat/32768/10000 6.387 GiB/sec 11.708 GiB/sec 83.320%
SumKernelDouble/32768/1 8.292 GiB/sec 11.524 GiB/sec 38.977%
SumKernelFloat/32768/0 6.367 GiB/sec 11.597 GiB/sec 82.144%
SumKernelInt64/32768/10 9.388 GiB/sec 11.948 GiB/sec 27.258%
SumKernelDouble/32768/2 7.300 GiB/sec 11.174 GiB/sec 53.070%
SumKernelFloat/32768/10 4.043 GiB/sec 6.488 GiB/sec 60.492%
SumKernelInt64/32768/0 11.229 GiB/sec 17.780 GiB/sec 58.336%
- SumKernelInt16/32768/1 3.098 GiB/sec 1.696 GiB/sec -45.248%
- SumKernelInt16/32768/10 3.364 GiB/sec 1.678 GiB/sec -50.122%
SumKernelFloat/32768/2 2.474 GiB/sec 6.453 GiB/sec 160.789%
SumKernelDouble/32768/10000 10.535 GiB/sec 16.165 GiB/sec 53.444%
SumKernelFloat/32768/1 3.863 GiB/sec 6.663 GiB/sec 72.458%
=========================== ============== =============== ========
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-647334741
> I think we also need a way of setting max runtime instruction set for runtime dispatch (apologies if there is one and I missed it)
Thanks, currently no. But it can easily supported by hooking into CpuInfo::Init(), tweak hardware_flags_ while checking the user ENV settings, something like "ARROW_RUNTIME_SIMD_LEVEL=none make unittest".
For split PR without intrinsics, I will try to work this. Now I used some compiler pragma(clang attribute push or GCC push_options) to support the intrinsic API, it's defined to NULL for MSVC as MSVC supports directly using intrinsic API without compiler flag. Need find a new method for new ways without intrinsic, maybe something like set_source_files_properties can help here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645135876
[AMD64 Ubuntu 18.04 C++ Benchmark (#112762)](https://ci.ursalabs.org/#builders/73/builds/79) builder has been succeeded.
Revision: 525caea882fe49c0248932fff77df6bcd3f2f477
```diff
=========================== ============= =============== ========
benchmark baseline contender change
=========================== ============= =============== ========
SumKernelFloat/32768/10000 5.958 GiB/sec 10.208 GiB/sec 71.337%
SumKernelInt64/32768/1 7.938 GiB/sec 10.783 GiB/sec 35.851%
SumKernelInt32/32768/0 4.958 GiB/sec 10.832 GiB/sec 118.486%
SumKernelInt64/32768/0 9.704 GiB/sec 14.764 GiB/sec 52.146%
SumKernelFloat/32768/2 2.389 GiB/sec 5.767 GiB/sec 141.401%
- SumKernelInt16/32768/2 2.953 GiB/sec 1.696 GiB/sec -42.565%
SumKernelDouble/32768/0 9.462 GiB/sec 13.540 GiB/sec 43.102%
SumKernelInt32/32768/2 5.074 GiB/sec 6.403 GiB/sec 26.201%
- SumKernelInt16/32768/10000 4.520 GiB/sec 1.693 GiB/sec -62.552%
SumKernelFloat/32768/100 4.644 GiB/sec 5.807 GiB/sec 25.051%
SumKernelInt64/32768/10000 9.778 GiB/sec 14.784 GiB/sec 51.203%
SumKernelDouble/32768/1 7.595 GiB/sec 10.131 GiB/sec 33.400%
SumKernelFloat/32768/1 3.920 GiB/sec 5.938 GiB/sec 51.490%
SumKernelInt64/32768/10 8.461 GiB/sec 10.548 GiB/sec 24.655%
- SumKernelInt8/32768/10 1.680 GiB/sec 674.552 MiB/sec -60.783%
SumKernelDouble/32768/10 7.748 GiB/sec 9.813 GiB/sec 26.645%
- SumKernelInt8/32768/1 1.643 GiB/sec 678.806 MiB/sec -59.643%
SumKernelInt8/32768/0 1.745 GiB/sec 4.709 GiB/sec 169.817%
SumKernelInt32/32768/10000 4.959 GiB/sec 10.884 GiB/sec 119.497%
SumKernelInt64/32768/100 9.294 GiB/sec 10.545 GiB/sec 13.466%
- SumKernelInt8/32768/100 2.242 GiB/sec 674.145 MiB/sec -70.638%
SumKernelInt64/32768/2 7.674 GiB/sec 10.553 GiB/sec 37.529%
SumKernelInt16/32768/0 2.850 GiB/sec 7.599 GiB/sec 166.670%
SumKernelInt32/32768/10 5.754 GiB/sec 6.423 GiB/sec 11.627%
SumKernelInt32/32768/1 5.213 GiB/sec 6.577 GiB/sec 26.172%
SumKernelDouble/32768/100 8.731 GiB/sec 9.829 GiB/sec 12.578%
SumKernelFloat/32768/0 5.960 GiB/sec 10.156 GiB/sec 70.394%
- SumKernelInt16/32768/100 4.099 GiB/sec 1.694 GiB/sec -58.674%
- SumKernelInt8/32768/10000 2.580 GiB/sec 674.639 MiB/sec -74.465%
SumKernelDouble/32768/2 6.864 GiB/sec 9.912 GiB/sec 44.416%
SumKernelInt32/32768/100 6.731 GiB/sec 6.439 GiB/sec -4.334%
- SumKernelInt8/32768/2 1.604 GiB/sec 674.265 MiB/sec -58.959%
- SumKernelInt16/32768/10 3.491 GiB/sec 1.694 GiB/sec -51.473%
SumKernelFloat/32768/10 3.924 GiB/sec 5.794 GiB/sec 47.673%
SumKernelDouble/32768/10000 9.531 GiB/sec 13.404 GiB/sec 40.636%
- SumKernelInt16/32768/1 3.021 GiB/sec 1.710 GiB/sec -43.392%
=========================== ============= =============== ========
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643115702
[AMD64 Ubuntu 18.04 C++ Benchmark (#111432)](https://ci.ursalabs.org/#builders/73/builds/73) builder has been succeeded.
Revision: 23aee3e5f6fc4c3fac7a859df7c52f79033c1c3c
```diff
=========================== ============== ============== ========
benchmark baseline contender change
=========================== ============== ============== ========
SumKernelFloat/32768/0 6.432 GiB/sec 11.724 GiB/sec 82.266%
SumKernelFloat/32768/2 2.437 GiB/sec 5.917 GiB/sec 142.784%
- SumKernelInt32/32768/100 7.333 GiB/sec 6.729 GiB/sec -8.233%
SumKernelInt8/32768/10 1.473 GiB/sec 1.904 GiB/sec 29.274%
SumKernelDouble/32768/1 8.371 GiB/sec 9.245 GiB/sec 10.443%
SumKernelInt16/32768/10000 4.770 GiB/sec 4.791 GiB/sec 0.442%
SumKernelInt16/32768/1 3.122 GiB/sec 3.160 GiB/sec 1.234%
SumKernelInt64/32768/10000 11.233 GiB/sec 17.690 GiB/sec 57.480%
- SumKernelDouble/32768/100 9.800 GiB/sec 8.990 GiB/sec -8.265%
SumKernelInt32/32768/1 5.552 GiB/sec 6.909 GiB/sec 24.431%
- SumKernelInt64/32768/100 10.468 GiB/sec 9.236 GiB/sec -11.768%
SumKernelInt64/32768/1 8.737 GiB/sec 9.467 GiB/sec 8.361%
SumKernelFloat/32768/1 3.921 GiB/sec 6.064 GiB/sec 54.637%
SumKernelFloat/32768/10 4.075 GiB/sec 5.926 GiB/sec 45.401%
SumKernelFloat/32768/100 4.894 GiB/sec 5.928 GiB/sec 21.126%
SumKernelInt16/32768/100 4.284 GiB/sec 4.452 GiB/sec 3.937%
SumKernelInt32/32768/10000 5.280 GiB/sec 12.466 GiB/sec 136.108%
SumKernelDouble/32768/0 10.742 GiB/sec 16.079 GiB/sec 49.674%
SumKernelInt8/32768/10000 2.670 GiB/sec 2.675 GiB/sec 0.166%
SumKernelInt64/32768/0 11.205 GiB/sec 17.720 GiB/sec 58.141%
SumKernelInt64/32768/2 8.471 GiB/sec 9.229 GiB/sec 8.946%
SumKernelInt16/32768/0 2.931 GiB/sec 8.214 GiB/sec 180.229%
SumKernelInt8/32768/0 1.782 GiB/sec 5.000 GiB/sec 180.532%
SumKernelDouble/32768/2 7.364 GiB/sec 8.942 GiB/sec 21.436%
SumKernelInt8/32768/100 2.308 GiB/sec 2.424 GiB/sec 5.042%
SumKernelDouble/32768/10 8.647 GiB/sec 9.001 GiB/sec 4.090%
SumKernelInt16/32768/10 3.547 GiB/sec 3.610 GiB/sec 1.781%
SumKernelInt8/32768/1 1.678 GiB/sec 1.686 GiB/sec 0.483%
SumKernelFloat/32768/10000 6.445 GiB/sec 11.534 GiB/sec 78.960%
SumKernelDouble/32768/10000 10.762 GiB/sec 16.131 GiB/sec 49.884%
SumKernelInt64/32768/10 9.395 GiB/sec 9.240 GiB/sec -1.652%
SumKernelInt8/32768/2 1.640 GiB/sec 1.648 GiB/sec 0.468%
SumKernelInt16/32768/2 3.052 GiB/sec 3.088 GiB/sec 1.177%
SumKernelInt32/32768/0 5.283 GiB/sec 12.487 GiB/sec 136.345%
SumKernelInt32/32768/10 6.192 GiB/sec 6.697 GiB/sec 8.159%
SumKernelInt32/32768/2 5.438 GiB/sec 6.727 GiB/sec 23.720%
=========================== ============== ============== ========
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636711082
@wesm Can you help to review the approach for runtime kernel path?
For the build part, I introduced some compiler option(GCC push_options or clang attribute push) to help build the SIMD code without a common SIMD flag defined in CXX flags.
For the SSE implementation, currently only dense part is implemented with intrinsic, I need more work for how to remove the invalid values.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-650865605
Close this one as reference of intrinsic. I will try to work out a new approach without intrinsic.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643950633
@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636673371
https://issues.apache.org/jira/browse/ARROW-8996
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639445690
Certainly someone (or several people) would have to evaluate those libraries before deciding whether to integrate one. Preferably someone with an interest in SIMD optimizations (for example you and/or @cyb70289).
But saying we don't want to use a helper library because the code might be suboptimal performance-wise and it's better to rewrite it from scratch doesn't sound like good engineering practice to me. Daily Arrow maintenance is done by a small set of developers and we should refrain from creating too much burden.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-638972317
I agree that it becomes problematic (both from a code size, testing, and maintenance perspective) to have SSE/AVX versions of every kernel. Having optimized versions of some important kernels that are known to use a lot of CPU cycles in applications is not a bad idea, though.
SIMD versions of kernels could also be moved to a "plugin" shared library so that they can be packaged separately.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-641136788
> Just find a document(https://dl.acm.org/doi/pdf/10.1145/3178433.3178435 PAGE 7, Table 6: Comparison of various SIMD wrappers.) Some SIMD helper(simdpp/xsimd) has performance issue at least on some workload.
Interesting comparison, thank you. But that is a single workload and we don't know what the issue was precisely.
By the way, this is what the paper says about the approach you're advocating:
> The intrinsics or assembly approaches are non-portable, low-level solutions which target specific architectures. They offer maximum control to take advantage of instruction set specificities, and to fine tune register usage. However, it is quite difficult to develop and maintain a low-level code in the long run.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jedbrown commented on a change in pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jedbrown commented on a change in pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#discussion_r437844193
##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_sse.cc
##########
@@ -0,0 +1,355 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_aggregate.h"
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+#include "arrow/compute/kernels/aggregate_internal.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/align_util.h"
+#include "arrow/util/simd.h"
+
+TARGET_CODE_START_SSE4_2
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+// ----------------------------------------------------------------------
+// Sum implementation for SSE
+
+// Each m128 stream handle 2 double/int64 accumulator type, one batch has 4 streams.
+static constexpr int kSseBatchStreams = 4;
+static constexpr int kSseStreamSize = sizeof(__m128d) / sizeof(double);
+static constexpr int kSseBatchSize = kSseBatchStreams * kSseStreamSize;
+
+// Default scalar version
+template <typename T, typename SumT>
+inline SumResult<SumT> SumDenseBatchSse(const T* values, int64_t num_batch) {
+ SumResult<SumT> sum_result;
+ SumT sum_streams[kSseBatchSize] = {0};
+
+ // Add the results by streams
+ for (int64_t batch = 0; batch < num_batch; batch++) {
+ for (int i = 0; i < kSseBatchSize; i++) {
+ sum_streams[i] += values[(batch * kSseBatchSize) + i];
+ }
+ }
+
+ // Aggregate the result streams
+ for (int i = 0; i < kSseBatchSize; i++) {
+ sum_result.sum += sum_streams[i];
+ }
+ sum_result.count = num_batch * kSseBatchSize;
+ return sum_result;
+}
+
+// Dense helper for accumulator type is same to data type
+#define SUM_DENSE_BATCH_SSE_DIRECT(Type, SumSimdType, SimdZeroFn, SimdLoadFn, SimdAddFn) \
+ template <> \
+ inline SumResult<Type> SumDenseBatchSse(const Type* values, int64_t num_batch) { \
+ SumResult<Type> sum_result; \
+ SumSimdType results_simd[kSseBatchStreams]; \
+ for (int i = 0; i < kSseBatchStreams; i++) { \
+ results_simd[i] = SimdZeroFn(); \
+ } \
+ \
+ /* Add the values to result streams */ \
+ for (int64_t batch = 0; batch < num_batch; batch++) { \
+ for (int i = 0; i < kSseBatchStreams; i++) { \
+ const auto src_simd = \
+ SimdLoadFn(&values[batch * kSseBatchSize + kSseStreamSize * i]); \
+ results_simd[i] = SimdAddFn(src_simd, results_simd[i]); \
+ } \
+ } \
Review comment:
You can use GCC pragmas to force more unrolling, but AVX512 vectorization gets you a ways despite the adder instruction latency. (My experience has been that GCC does an excellent job with intrinsics, often better than ICC, and tends to favor smaller code at comparable optimization levels. Clang tends to be more aggressive than GCC at unrolling, though I've often seen that not pay off.)
The masked case is harder and I'm not aware of an idiomatic way to get GCC to vectorize it. We get some masked vectorized code with clang, though it isn't as clean as I'd hope for.
```c
double sum1(int64_t n, const unsigned char *mask, const double *values) {
double sum = 0;
#pragma omp simd reduction(+:sum) collapse(2)
for (int64_t i = 0; i < n/8; i++) {
for (int64_t j = 0; j < 8; j++) {
sum += mask[i] & (1 << j) ? values[i*8+j] : 0;
}
}
return sum;
}
```
https://gcc.godbolt.org/z/84anmr
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637218125
I'll try to review this in the next couple days
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind removed a comment on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind removed a comment on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636677517
Benchmark data:
Before:
```
SumKernelFloat/32768/0 2.96 us 2.96 us 236912 bytes_per_second=10.3227G/s null_percent=0 size=32.768k
SumKernelFloat/32768/1 4.88 us 4.88 us 143527 bytes_per_second=6.25439G/s null_percent=1 size=32.768k
SumKernelFloat/32768/10 5.13 us 5.13 us 136839 bytes_per_second=5.95117G/s null_percent=10 size=32.768k
SumKernelFloat/32768/50 7.82 us 7.81 us 87129 bytes_per_second=3.9054G/s null_percent=50 size=32.768k
SumKernelDouble/32768/0 1.97 us 1.97 us 356786 bytes_per_second=15.4906G/s null_percent=0 size=32.768k
SumKernelDouble/32768/1 2.11 us 2.11 us 331511 bytes_per_second=14.4975G/s null_percent=1 size=32.768k
SumKernelDouble/32768/10 2.39 us 2.38 us 291292 bytes_per_second=12.7966G/s null_percent=10 size=32.768k
SumKernelDouble/32768/50 2.60 us 2.60 us 268800 bytes_per_second=11.7462G/s null_percent=50 size=32.768k
SumKernelInt8/32768/0 11.7 us 11.7 us 59926 bytes_per_second=2.61569G/s null_percent=0 size=32.768k
SumKernelInt8/32768/1 11.0 us 10.9 us 63640 bytes_per_second=2.78831G/s null_percent=1 size=32.768k
SumKernelInt8/32768/10 14.8 us 14.8 us 46573 bytes_per_second=2.05848G/s null_percent=10 size=32.768k
SumKernelInt8/32768/50 14.6 us 14.6 us 47840 bytes_per_second=2.08905G/s null_percent=50 size=32.768k
SumKernelInt16/32768/0 7.06 us 7.06 us 99354 bytes_per_second=4.3245G/s null_percent=0 size=32.768k
SumKernelInt16/32768/1 4.76 us 4.75 us 147305 bytes_per_second=6.41928G/s null_percent=1 size=32.768k
SumKernelInt16/32768/10 5.64 us 5.63 us 122737 bytes_per_second=5.42002G/s null_percent=10 size=32.768k
SumKernelInt16/32768/50 6.71 us 6.70 us 104192 bytes_per_second=4.55206G/s null_percent=50 size=32.768k
SumKernelInt32/32768/0 3.92 us 3.92 us 178798 bytes_per_second=7.79042G/s null_percent=0 size=32.768k
SumKernelInt32/32768/1 3.27 us 3.27 us 214296 bytes_per_second=9.332G/s null_percent=1 size=32.768k
SumKernelInt32/32768/10 3.41 us 3.40 us 204944 bytes_per_second=8.9683G/s null_percent=10 size=32.768k
SumKernelInt32/32768/50 3.69 us 3.69 us 190248 bytes_per_second=8.27705G/s null_percent=50 size=32.768k
SumKernelInt64/32768/0 1.92 us 1.91 us 368662 bytes_per_second=15.9508G/s null_percent=0 size=32.768k
SumKernelInt64/32768/1 2.05 us 2.05 us 340168 bytes_per_second=14.8684G/s null_percent=1 size=32.768k
SumKernelInt64/32768/10 2.16 us 2.16 us 323585 bytes_per_second=14.1164G/s null_percent=10 size=32.768k
SumKernelInt64/32768/50 2.41 us 2.41 us 291073 bytes_per_second=12.6873G/s null_percent=50 size=32.768k
```
After:
```
SumKernelFloat/32768/0 2.27 us 2.27 us 307928 bytes_per_second=13.438G/s null_percent=0 size=32.768k
SumKernelFloat/32768/1 4.59 us 4.59 us 152827 bytes_per_second=6.6508G/s null_percent=1 size=32.768k
SumKernelFloat/32768/10 5.30 us 5.29 us 132106 bytes_per_second=5.76658G/s null_percent=10 size=32.768k
SumKernelFloat/32768/50 5.80 us 5.80 us 114378 bytes_per_second=5.26584G/s null_percent=50 size=32.768k
SumKernelDouble/32768/0 1.42 us 1.42 us 494426 bytes_per_second=21.5265G/s null_percent=0 size=32.768k
SumKernelDouble/32768/1 2.12 us 2.12 us 330890 bytes_per_second=14.4268G/s null_percent=1 size=32.768k
SumKernelDouble/32768/10 2.44 us 2.43 us 286310 bytes_per_second=12.5441G/s null_percent=10 size=32.768k
SumKernelDouble/32768/50 2.72 us 2.71 us 257105 bytes_per_second=11.2507G/s null_percent=50 size=32.768k
SumKernelInt8/32768/0 5.35 us 5.34 us 130751 bytes_per_second=5.71315G/s null_percent=0 size=32.768k
SumKernelInt8/32768/1 9.80 us 9.79 us 71384 bytes_per_second=3.11589G/s null_percent=1 size=32.768k
SumKernelInt8/32768/10 13.9 us 13.9 us 49729 bytes_per_second=2.19116G/s null_percent=10 size=32.768k
SumKernelInt8/32768/50 12.5 us 12.5 us 55929 bytes_per_second=2.43479G/s null_percent=50 size=32.768k
SumKernelInt16/32768/0 3.20 us 3.19 us 218923 bytes_per_second=9.55594G/s null_percent=0 size=32.768k
SumKernelInt16/32768/1 5.31 us 5.31 us 131394 bytes_per_second=5.75174G/s null_percent=1 size=32.768k
SumKernelInt16/32768/10 6.20 us 6.19 us 113037 bytes_per_second=4.92965G/s null_percent=10 size=32.768k
SumKernelInt16/32768/50 7.25 us 7.24 us 96604 bytes_per_second=4.21535G/s null_percent=50 size=32.768k
SumKernelInt32/32768/0 2.18 us 2.18 us 321572 bytes_per_second=14.0037G/s null_percent=0 size=32.768k
SumKernelInt32/32768/1 3.32 us 3.32 us 209911 bytes_per_second=9.18857G/s null_percent=1 size=32.768k
SumKernelInt32/32768/10 3.59 us 3.58 us 195106 bytes_per_second=8.51472G/s null_percent=10 size=32.768k
SumKernelInt32/32768/50 3.83 us 3.82 us 182739 bytes_per_second=7.98056G/s null_percent=50 size=32.768k
SumKernelInt64/32768/0 1.37 us 1.37 us 514237 bytes_per_second=22.3564G/s null_percent=0 size=32.768k
SumKernelInt64/32768/1 2.09 us 2.09 us 333678 bytes_per_second=14.5962G/s null_percent=1 size=32.768k
SumKernelInt64/32768/10 2.18 us 2.18 us 320094 bytes_per_second=13.9904G/s null_percent=10 size=32.768k
SumKernelInt64/32768/50 2.41 us 2.40 us 289766 bytes_per_second=12.6907G/s null_percent=50 size=32.768k
```
All dense part of data types has some improvements , ex Double jump to 21.5265G/s from 15.4906G/s.
The sparse parts I will look into later as it need some additional to remove the invalid value before passing to the SIMD add operations, it need some shuffle op to replace the invalid value to zero.
Also the dense part can be speed up again if using AVX2/AVX512 which is a later job also.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645130206
@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-636677517
Benchmark data:
Before:
```
SumKernelFloat/32768/0 2.96 us 2.96 us 236912 bytes_per_second=10.3227G/s null_percent=0 size=32.768k
SumKernelFloat/32768/1 4.88 us 4.88 us 143527 bytes_per_second=6.25439G/s null_percent=1 size=32.768k
SumKernelFloat/32768/10 5.13 us 5.13 us 136839 bytes_per_second=5.95117G/s null_percent=10 size=32.768k
SumKernelFloat/32768/50 7.82 us 7.81 us 87129 bytes_per_second=3.9054G/s null_percent=50 size=32.768k
SumKernelDouble/32768/0 1.97 us 1.97 us 356786 bytes_per_second=15.4906G/s null_percent=0 size=32.768k
SumKernelDouble/32768/1 2.11 us 2.11 us 331511 bytes_per_second=14.4975G/s null_percent=1 size=32.768k
SumKernelDouble/32768/10 2.39 us 2.38 us 291292 bytes_per_second=12.7966G/s null_percent=10 size=32.768k
SumKernelDouble/32768/50 2.60 us 2.60 us 268800 bytes_per_second=11.7462G/s null_percent=50 size=32.768k
SumKernelInt8/32768/0 11.7 us 11.7 us 59926 bytes_per_second=2.61569G/s null_percent=0 size=32.768k
SumKernelInt8/32768/1 11.0 us 10.9 us 63640 bytes_per_second=2.78831G/s null_percent=1 size=32.768k
SumKernelInt8/32768/10 14.8 us 14.8 us 46573 bytes_per_second=2.05848G/s null_percent=10 size=32.768k
SumKernelInt8/32768/50 14.6 us 14.6 us 47840 bytes_per_second=2.08905G/s null_percent=50 size=32.768k
SumKernelInt16/32768/0 7.06 us 7.06 us 99354 bytes_per_second=4.3245G/s null_percent=0 size=32.768k
SumKernelInt16/32768/1 4.76 us 4.75 us 147305 bytes_per_second=6.41928G/s null_percent=1 size=32.768k
SumKernelInt16/32768/10 5.64 us 5.63 us 122737 bytes_per_second=5.42002G/s null_percent=10 size=32.768k
SumKernelInt16/32768/50 6.71 us 6.70 us 104192 bytes_per_second=4.55206G/s null_percent=50 size=32.768k
SumKernelInt32/32768/0 3.92 us 3.92 us 178798 bytes_per_second=7.79042G/s null_percent=0 size=32.768k
SumKernelInt32/32768/1 3.27 us 3.27 us 214296 bytes_per_second=9.332G/s null_percent=1 size=32.768k
SumKernelInt32/32768/10 3.41 us 3.40 us 204944 bytes_per_second=8.9683G/s null_percent=10 size=32.768k
SumKernelInt32/32768/50 3.69 us 3.69 us 190248 bytes_per_second=8.27705G/s null_percent=50 size=32.768k
SumKernelInt64/32768/0 1.92 us 1.91 us 368662 bytes_per_second=15.9508G/s null_percent=0 size=32.768k
SumKernelInt64/32768/1 2.05 us 2.05 us 340168 bytes_per_second=14.8684G/s null_percent=1 size=32.768k
SumKernelInt64/32768/10 2.16 us 2.16 us 323585 bytes_per_second=14.1164G/s null_percent=10 size=32.768k
SumKernelInt64/32768/50 2.41 us 2.41 us 291073 bytes_per_second=12.6873G/s null_percent=50 size=32.768k
```
After:
```
SumKernelFloat/32768/0 2.27 us 2.27 us 307928 bytes_per_second=13.438G/s null_percent=0 size=32.768k
SumKernelFloat/32768/1 4.59 us 4.59 us 152827 bytes_per_second=6.6508G/s null_percent=1 size=32.768k
SumKernelFloat/32768/10 5.30 us 5.29 us 132106 bytes_per_second=5.76658G/s null_percent=10 size=32.768k
SumKernelFloat/32768/50 5.80 us 5.80 us 114378 bytes_per_second=5.26584G/s null_percent=50 size=32.768k
SumKernelDouble/32768/0 1.42 us 1.42 us 494426 bytes_per_second=21.5265G/s null_percent=0 size=32.768k
SumKernelDouble/32768/1 2.12 us 2.12 us 330890 bytes_per_second=14.4268G/s null_percent=1 size=32.768k
SumKernelDouble/32768/10 2.44 us 2.43 us 286310 bytes_per_second=12.5441G/s null_percent=10 size=32.768k
SumKernelDouble/32768/50 2.72 us 2.71 us 257105 bytes_per_second=11.2507G/s null_percent=50 size=32.768k
SumKernelInt8/32768/0 5.35 us 5.34 us 130751 bytes_per_second=5.71315G/s null_percent=0 size=32.768k
SumKernelInt8/32768/1 9.80 us 9.79 us 71384 bytes_per_second=3.11589G/s null_percent=1 size=32.768k
SumKernelInt8/32768/10 13.9 us 13.9 us 49729 bytes_per_second=2.19116G/s null_percent=10 size=32.768k
SumKernelInt8/32768/50 12.5 us 12.5 us 55929 bytes_per_second=2.43479G/s null_percent=50 size=32.768k
SumKernelInt16/32768/0 3.20 us 3.19 us 218923 bytes_per_second=9.55594G/s null_percent=0 size=32.768k
SumKernelInt16/32768/1 5.31 us 5.31 us 131394 bytes_per_second=5.75174G/s null_percent=1 size=32.768k
SumKernelInt16/32768/10 6.20 us 6.19 us 113037 bytes_per_second=4.92965G/s null_percent=10 size=32.768k
SumKernelInt16/32768/50 7.25 us 7.24 us 96604 bytes_per_second=4.21535G/s null_percent=50 size=32.768k
SumKernelInt32/32768/0 2.18 us 2.18 us 321572 bytes_per_second=14.0037G/s null_percent=0 size=32.768k
SumKernelInt32/32768/1 3.32 us 3.32 us 209911 bytes_per_second=9.18857G/s null_percent=1 size=32.768k
SumKernelInt32/32768/10 3.59 us 3.58 us 195106 bytes_per_second=8.51472G/s null_percent=10 size=32.768k
SumKernelInt32/32768/50 3.83 us 3.82 us 182739 bytes_per_second=7.98056G/s null_percent=50 size=32.768k
SumKernelInt64/32768/0 1.37 us 1.37 us 514237 bytes_per_second=22.3564G/s null_percent=0 size=32.768k
SumKernelInt64/32768/1 2.09 us 2.09 us 333678 bytes_per_second=14.5962G/s null_percent=1 size=32.768k
SumKernelInt64/32768/10 2.18 us 2.18 us 320094 bytes_per_second=13.9904G/s null_percent=10 size=32.768k
SumKernelInt64/32768/50 2.41 us 2.40 us 289766 bytes_per_second=12.6907G/s null_percent=50 size=32.768k
```
All dense part of data types has some improvements , ex Double jump to 21.5265G/s from 15.4906G/s.
The sparse parts I will look into later as it need some additional to remove the invalid value before passing to the SIMD add operations, it need some shuffle op to replace the invalid value to zero.
Also the dense part can be speed up again if using AVX2/AVX512 which is a later job also.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639009091
```
no such option: --cc
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645294042
The numbers looks good for sparse part on the benchmark device also now, even though locally my device can get more better speed up thanks to bigger L2/L3 size. I don't know if it's a good direction considering compiler can do less things on the bitmap based.
@emkornfield And others, thoughts?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637216678
[AMD64 Ubuntu 18.04 C++ Benchmark (#108899)](https://ci.ursalabs.org/#builders/73/builds/70) builder has been succeeded.
Revision: d789801a5e83b9717b8f36d72189198f5527b757
```diff
======================== ============== ============== ========
benchmark baseline contender change
======================== ============== ============== ========
SumKernelInt64/32768/1 10.639 GiB/sec 10.196 GiB/sec -4.168%
SumKernelInt8/32768/10 1.490 GiB/sec 1.445 GiB/sec -3.007%
SumKernelFloat/32768/0 6.407 GiB/sec 8.436 GiB/sec 31.670%
SumKernelInt32/32768/0 5.307 GiB/sec 9.227 GiB/sec 73.863%
SumKernelDouble/32768/10 8.489 GiB/sec 8.436 GiB/sec -0.616%
SumKernelInt32/32768/50 5.451 GiB/sec 5.242 GiB/sec -3.841%
SumKernelDouble/32768/1 9.690 GiB/sec 9.568 GiB/sec -1.255%
SumKernelInt8/32768/0 1.782 GiB/sec 3.726 GiB/sec 109.147%
- SumKernelFloat/32768/1 4.899 GiB/sec 4.583 GiB/sec -6.447%
- SumKernelInt8/32768/1 2.111 GiB/sec 1.917 GiB/sec -9.222%
SumKernelInt32/32768/1 7.389 GiB/sec 7.281 GiB/sec -1.470%
- SumKernelInt16/32768/1 4.350 GiB/sec 3.855 GiB/sec -11.363%
SumKernelDouble/32768/0 10.647 GiB/sec 14.048 GiB/sec 31.949%
SumKernelInt8/32768/50 1.641 GiB/sec 1.580 GiB/sec -3.726%
SumKernelFloat/32768/50 2.518 GiB/sec 2.545 GiB/sec 1.052%
SumKernelInt16/32768/0 2.953 GiB/sec 6.038 GiB/sec 104.455%
SumKernelInt64/32768/50 8.469 GiB/sec 8.439 GiB/sec -0.349%
SumKernelInt64/32768/0 11.243 GiB/sec 15.118 GiB/sec 34.462%
SumKernelInt16/32768/50 3.070 GiB/sec 2.960 GiB/sec -3.604%
SumKernelInt64/32768/10 9.242 GiB/sec 9.336 GiB/sec 1.020%
- SumKernelInt16/32768/10 3.638 GiB/sec 3.390 GiB/sec -6.834%
SumKernelInt32/32768/10 6.253 GiB/sec 6.012 GiB/sec -3.844%
SumKernelFloat/32768/10 4.111 GiB/sec 3.935 GiB/sec -4.279%
SumKernelDouble/32768/50 7.446 GiB/sec 7.455 GiB/sec 0.133%
======================== ============== ============== ========
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-638970620
This looks generally quite complicated. If we need 500 additional lines of code to micro-optimize the Sum kernel for a single SIMD instruction set (nevermind that we may also want versions for AVX2, Neon, SVE, and whatnot), things will quickly get out of hand.
If we want to go the way of per-kernel SIMD optimizations, it may be useful to investigate SIMD helper libraries (such as [libsimdpp](https://github.com/p12tic/libsimdpp), [xsimd](https://xsimd.readthedocs.io/en/latest/)...).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-641109277
Just find a document(https://dl.acm.org/doi/pdf/10.1145/3178433.3178435 PAGE 7, Table 6: Comparison of various SIMD wrappers.) Some SIMD helper(simdpp/xsimd) has performance issue at least on some workload.
Another thing is most SIMD helpers has no runtime support, it means we still has to build same code(if we can find a common code path for one function) many times on arrow itself for the runtime capacity.
And I'm working on the sparse part for aggregate sum recently, the data flow is total different for AVX2/AVX512. AVX512 has _mm512_mask_add_pd (__m512d src, __mmask8 k, __m512d a, __m512d b) support that it can SIMD add the results directly on the valid bit map. For AVX2, it has to use a lookup table mask with SIMD and operation to zero the invalid values before passing to SIMD add. The difference is applied to other future SIMD func also as all arrow data represented with valid bit map.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on a change in pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on a change in pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#discussion_r437804878
##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_sse.cc
##########
@@ -0,0 +1,355 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_aggregate.h"
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+#include "arrow/compute/kernels/aggregate_internal.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/align_util.h"
+#include "arrow/util/simd.h"
+
+TARGET_CODE_START_SSE4_2
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+// ----------------------------------------------------------------------
+// Sum implementation for SSE
+
+// Each m128 stream handle 2 double/int64 accumulator type, one batch has 4 streams.
+static constexpr int kSseBatchStreams = 4;
+static constexpr int kSseStreamSize = sizeof(__m128d) / sizeof(double);
+static constexpr int kSseBatchSize = kSseBatchStreams * kSseStreamSize;
+
+// Default scalar version
+template <typename T, typename SumT>
+inline SumResult<SumT> SumDenseBatchSse(const T* values, int64_t num_batch) {
+ SumResult<SumT> sum_result;
+ SumT sum_streams[kSseBatchSize] = {0};
+
+ // Add the results by streams
+ for (int64_t batch = 0; batch < num_batch; batch++) {
+ for (int i = 0; i < kSseBatchSize; i++) {
+ sum_streams[i] += values[(batch * kSseBatchSize) + i];
+ }
+ }
+
+ // Aggregate the result streams
+ for (int i = 0; i < kSseBatchSize; i++) {
+ sum_result.sum += sum_streams[i];
+ }
+ sum_result.count = num_batch * kSseBatchSize;
+ return sum_result;
+}
+
+// Dense helper for accumulator type is same to data type
+#define SUM_DENSE_BATCH_SSE_DIRECT(Type, SumSimdType, SimdZeroFn, SimdLoadFn, SimdAddFn) \
+ template <> \
+ inline SumResult<Type> SumDenseBatchSse(const Type* values, int64_t num_batch) { \
+ SumResult<Type> sum_result; \
+ SumSimdType results_simd[kSseBatchStreams]; \
+ for (int i = 0; i < kSseBatchStreams; i++) { \
+ results_simd[i] = SimdZeroFn(); \
+ } \
+ \
+ /* Add the values to result streams */ \
+ for (int64_t batch = 0; batch < num_batch; batch++) { \
+ for (int i = 0; i < kSseBatchStreams; i++) { \
+ const auto src_simd = \
+ SimdLoadFn(&values[batch * kSseBatchSize + kSseStreamSize * i]); \
+ results_simd[i] = SimdAddFn(src_simd, results_simd[i]); \
+ } \
+ } \
Review comment:
Thanks, the vectorize code by clang is really good. But seems the GCC part only use zmm0 which may limit the performance. This PR only aims on dense part, the final target is expanding the support to sparse part also, do you know if sparse can applied similar approach also? Example scalar code like below
inline T MaskedValue(bool valid, T value) const { return valid ? value : 0; }
for (size_t i = 0; i < 8; i++) {
local.sum += MaskedValue(bits & (1U << i), values[i]);
}
local.count += BitUtil::kBytePopcount[bits];
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-643106609
@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637212463
@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639285090
> This looks generally quite complicated. If we need 500 additional lines of code to micro-optimize the Sum kernel for a single SIMD instruction set (nevermind that we may also want versions for AVX2, Neon, SVE, and whatnot), things will quickly get out of hand.
>
> If we want to go the way of per-kernel SIMD optimizations, it may be useful to investigate SIMD helper libraries (such as [libsimdpp](https://github.com/p12tic/libsimdpp), [xsimd](https://xsimd.readthedocs.io/en/latest/)...).
libsimdpp says it has a runtime dispatch support based on function basic. But I don't know if the quality of these libs is perfect enough to use and it was qualified in every spec, it will make the debug things more hard for me if there's any bug there. Another concern is the performance, will the wrapper introduce extra cost? And usually the SIMD code/parameters is designed carefully for each architecture to get best performance, I suspect there's a common code can fit all target.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jianxind commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645246111
@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-639009079
@ursabot benchmark --suite-filter=arrow-compute-aggregate-benchmark --cc=clang --cxx=clang++
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-637218125
I'll try to review this in the next couple days. @pitrou may be able to help also
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jedbrown commented on a change in pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
jedbrown commented on a change in pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#discussion_r437479652
##########
File path: cpp/src/arrow/compute/kernels/aggregate_basic_sse.cc
##########
@@ -0,0 +1,355 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/api_aggregate.h"
+#include "arrow/compute/kernels/aggregate_basic_internal.h"
+#include "arrow/compute/kernels/aggregate_internal.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/util/align_util.h"
+#include "arrow/util/simd.h"
+
+TARGET_CODE_START_SSE4_2
+namespace arrow {
+namespace compute {
+namespace aggregate {
+
+// ----------------------------------------------------------------------
+// Sum implementation for SSE
+
+// Each m128 stream handle 2 double/int64 accumulator type, one batch has 4 streams.
+static constexpr int kSseBatchStreams = 4;
+static constexpr int kSseStreamSize = sizeof(__m128d) / sizeof(double);
+static constexpr int kSseBatchSize = kSseBatchStreams * kSseStreamSize;
+
+// Default scalar version
+template <typename T, typename SumT>
+inline SumResult<SumT> SumDenseBatchSse(const T* values, int64_t num_batch) {
+ SumResult<SumT> sum_result;
+ SumT sum_streams[kSseBatchSize] = {0};
+
+ // Add the results by streams
+ for (int64_t batch = 0; batch < num_batch; batch++) {
+ for (int i = 0; i < kSseBatchSize; i++) {
+ sum_streams[i] += values[(batch * kSseBatchSize) + i];
+ }
+ }
+
+ // Aggregate the result streams
+ for (int i = 0; i < kSseBatchSize; i++) {
+ sum_result.sum += sum_streams[i];
+ }
+ sum_result.count = num_batch * kSseBatchSize;
+ return sum_result;
+}
+
+// Dense helper for accumulator type is same to data type
+#define SUM_DENSE_BATCH_SSE_DIRECT(Type, SumSimdType, SimdZeroFn, SimdLoadFn, SimdAddFn) \
+ template <> \
+ inline SumResult<Type> SumDenseBatchSse(const Type* values, int64_t num_batch) { \
+ SumResult<Type> sum_result; \
+ SumSimdType results_simd[kSseBatchStreams]; \
+ for (int i = 0; i < kSseBatchStreams; i++) { \
+ results_simd[i] = SimdZeroFn(); \
+ } \
+ \
+ /* Add the values to result streams */ \
+ for (int64_t batch = 0; batch < num_batch; batch++) { \
+ for (int i = 0; i < kSseBatchStreams; i++) { \
+ const auto src_simd = \
+ SimdLoadFn(&values[batch * kSseBatchSize + kSseStreamSize * i]); \
+ results_simd[i] = SimdAddFn(src_simd, results_simd[i]); \
+ } \
+ } \
Review comment:
Have you considered using OpenMP SIMD (i.e., `-fopenmp-simd`, which does not do anything with threads or require linking OpenMP)?
```c
double sum1(int64_t n, const double *values) {
double sum = 0;
#pragma omp simd reduction(+:sum)
for (int64_t i = 0; i < n; i++)
sum += values[i];
return sum;
}
```
produces good vectorized code (probably excessively unrolled by clang)
```
vaddpd zmm0, zmm0, zmmword ptr [rsi + 8*rdx]
vaddpd zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 64]
vaddpd zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 128]
vaddpd zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 192]
vaddpd zmm0, zmm0, zmmword ptr [rsi + 8*rdx + 256]
vaddpd zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 320]
vaddpd zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 384]
vaddpd zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 448]
vaddpd zmm0, zmm0, zmmword ptr [rsi + 8*rdx + 512]
vaddpd zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 576]
vaddpd zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 640]
vaddpd zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 704]
vaddpd zmm0, zmm0, zmmword ptr [rsi + 8*rdx + 768]
vaddpd zmm1, zmm1, zmmword ptr [rsi + 8*rdx + 832]
vaddpd zmm2, zmm2, zmmword ptr [rsi + 8*rdx + 896]
vaddpd zmm3, zmm3, zmmword ptr [rsi + 8*rdx + 960]
.loc 1 6 5 # ./example.c:6:5
sub rdx, -128
add rcx, -4
jne .LBB0_7
```
https://gcc.godbolt.org/z/6mD6oy
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7314: ARROW-8996: [C++] SSE runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-638885045
I'll look at this today
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7314: ARROW-8996: [C++] runtime support for aggregate sum dense kernel
Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-645256638
[AMD64 Ubuntu 18.04 C++ Benchmark (#112803)](https://ci.ursalabs.org/#builders/73/builds/80) builder has been succeeded.
Revision: 665c6cdb30f0ec372fa71de3e47bd5951183fe7d
```diff
=========================== ============= ============== ========
benchmark baseline contender change
=========================== ============= ============== ========
SumKernelDouble/32768/2 6.910 GiB/sec 10.005 GiB/sec 44.801%
SumKernelInt16/32768/2 2.952 GiB/sec 6.441 GiB/sec 118.182%
SumKernelInt32/32768/2 5.090 GiB/sec 7.097 GiB/sec 39.433%
SumKernelDouble/32768/0 9.551 GiB/sec 13.401 GiB/sec 40.317%
SumKernelDouble/32768/10000 9.572 GiB/sec 13.236 GiB/sec 38.276%
SumKernelInt64/32768/1 7.936 GiB/sec 10.750 GiB/sec 35.453%
SumKernelInt16/32768/10 3.493 GiB/sec 6.395 GiB/sec 83.081%
SumKernelFloat/32768/0 5.966 GiB/sec 9.990 GiB/sec 67.445%
SumKernelInt8/32768/1 1.641 GiB/sec 2.732 GiB/sec 66.423%
SumKernelInt64/32768/0 9.878 GiB/sec 14.459 GiB/sec 46.376%
SumKernelFloat/32768/100 4.616 GiB/sec 6.404 GiB/sec 38.746%
SumKernelInt64/32768/10000 9.829 GiB/sec 14.577 GiB/sec 48.310%
SumKernelInt16/32768/100 4.090 GiB/sec 6.389 GiB/sec 56.214%
SumKernelDouble/32768/10 7.794 GiB/sec 9.882 GiB/sec 26.798%
SumKernelInt16/32768/0 2.845 GiB/sec 7.555 GiB/sec 165.554%
SumKernelFloat/32768/10 3.928 GiB/sec 6.381 GiB/sec 62.441%
SumKernelFloat/32768/10000 5.946 GiB/sec 9.947 GiB/sec 67.286%
SumKernelInt32/32768/1 5.208 GiB/sec 7.282 GiB/sec 39.809%
SumKernelInt8/32768/100 2.252 GiB/sec 2.654 GiB/sec 17.841%
SumKernelInt8/32768/2 1.607 GiB/sec 2.656 GiB/sec 65.290%
SumKernelInt32/32768/10 5.778 GiB/sec 7.096 GiB/sec 22.813%
SumKernelInt32/32768/100 6.732 GiB/sec 7.098 GiB/sec 5.427%
SumKernelFloat/32768/2 2.386 GiB/sec 6.396 GiB/sec 168.068%
SumKernelInt8/32768/10000 2.586 GiB/sec 2.652 GiB/sec 2.561%
SumKernelInt64/32768/100 9.319 GiB/sec 10.627 GiB/sec 14.034%
SumKernelDouble/32768/100 8.789 GiB/sec 9.860 GiB/sec 12.180%
SumKernelInt16/32768/10000 4.519 GiB/sec 6.381 GiB/sec 41.217%
SumKernelInt8/32768/0 1.745 GiB/sec 4.701 GiB/sec 169.458%
SumKernelInt32/32768/10000 4.953 GiB/sec 10.825 GiB/sec 118.563%
SumKernelInt64/32768/10 8.571 GiB/sec 10.629 GiB/sec 24.013%
SumKernelInt8/32768/10 1.687 GiB/sec 2.656 GiB/sec 57.429%
SumKernelInt64/32768/2 7.717 GiB/sec 10.658 GiB/sec 38.109%
SumKernelInt32/32768/0 4.961 GiB/sec 10.791 GiB/sec 117.521%
SumKernelDouble/32768/1 7.669 GiB/sec 10.079 GiB/sec 31.424%
SumKernelInt16/32768/1 3.018 GiB/sec 6.589 GiB/sec 118.362%
SumKernelFloat/32768/1 3.928 GiB/sec 6.600 GiB/sec 68.035%
=========================== ============= ============== ========
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jianxind closed pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel
Posted by GitBox <gi...@apache.org>.
jianxind closed pull request #7314:
URL: https://github.com/apache/arrow/pull/7314
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7314: ARROW-8996: [C++] AVX2/AVX512 runtime support for aggregate sum kernel
Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #7314:
URL: https://github.com/apache/arrow/pull/7314#issuecomment-647228634
I think we also need a way of setting max runtime instruction set for runtime dispatch (apologies if there is one and I missed it)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org