You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2023/09/12 15:18:47 UTC

[GitHub] [arrow] pitrou opened a new issue, #37678: [C++][Compute] Checked arithmetic functions are slow-ish

pitrou opened a new issue, #37678:
URL: https://github.com/apache/arrow/issues/37678

   ### Describe the enhancement requested
   
   Users should generally prefer using the checked versions of arithmetic functions such as "add", because unchecked arithmetic is inherently unsafe and can silently corrupt data (MySQL is commonly hated for this). However, our checked arithmetic functions are quite slower than unchecked:
   
   For example, "add" vs. "add_checked":
   ```console
   $ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add(a, a)"
   2000 loops, best of 5: 128 usec per loop
   $ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add_checked(a, a)"
   200 loops, best of 5: 1.92 msec per loop
   ```
   ... that's 7.8 Gitems/sec vs. 0.5 Gitems/sec., or a 15x difference.
   
   "sum" vs. "sum_checked" shows a similar problem in PR #37536.
   
   Implementation-wise, "add_checked" is currently simple and naïve: it adds and detects overflow on each pair of values, one by one. This adds many conditional branches and probably stifles parallelism on the CPU, as well as auto-vectorization on the compiler.
   
   A potentially better implementation would be to operate on K-sized batches (with a K such as 256), in two steps:
   1. detect potential overflow over all K pairs of values
   2. if no overflow was detected, add all K pairs or values
   
   Each step is then unconditional and a reasonable candidate for auto-vectorization.
   
   Of course, we can also consider that 0.5 Gitems/sec is good enough and unlikely to be a bottleneck for non-trivial workloads.
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "cyb70289 (via GitHub)" <gi...@apache.org>.
cyb70289 commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1717795510

   Did a quick test, looks the gap should be smaller.
   https://quick-bench.com/q/ywSl6NYEUc2KYq4h6MkP7t3yWqo
   
   The checked version uses gcc `__builtin_*_overflow`, same as [safe-math](https://github.com/nemequ/portable-snippets/tree/master/safe-math).
   From disassembly, the uncheck version is vectorized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1715974715

   I think it would be interesting to see just how much faster working on the K-size batches would be and see if it's worthwhile. This would also be potentially a worthwhile enhancement for the Go side too which works the same way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1719404981

   > I wrote a simple kernel exec that checks the nullity only when overflow occurs:
   
   This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1718829195

   `Add` doesn't check nulls and simply computes all pairs, null or not, while `AddChecked` only computes the valid pairs. (Code is [here](https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_arithmetic.cc#L1323), it uses MakeArithmeticFunction**NotNull** while `Add` uses MakeArithmeticFunction).  I guess it was assumed that `__builtin_*_overflow` is an expensive operation and should be avoided as much as possible. However, I ran some benchmarks and the results don't agree with this assumption.
   With null checking as it currently is, `AddChecked` is 15\~40x slower and the benefits of null checks only happen when >95% values are null, which I'd say is a rare case.
   ![image](https://github.com/apache/arrow/assets/14072670/0aeeafef-c171-4d6f-8dde-7b11f58386d9)
   Without null checking, i.e. also using `MakeArithmeticFunction` for `AddChecked`, it is only 2.5 times slower, no matter the null proportion of the input:
   ![image](https://github.com/apache/arrow/assets/14072670/f8e54d5c-446f-465a-9246-e44ba815499a)
   
   This explains the difference between @pitrou and @cyb70289's results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1717855783

   Does python introduce some overhead on this? It seems that gap on the pure C++ side (from @cyb70289) is smaller.
   
   > By the way, the benchmarks are posted don't involve any nulls, so the real-world differences might be more nuanced.
   
   Null checks can be on top of the batch arithmetic-safe functions provided the values do not overflow. This can help avoid some unnecessary conditions.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1715928956

   @westonpace @felipecrv Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1719364274

   I wrote a simple kernel exec that checks the nullity only when overflow occurs:
   ```cpp
   Status AddCheckedExec(KernelContext* ctx, const ExecSpan& span, ExecResult* result) {
     int64_t length = span.length;
     auto left_arr = span[0].array;
     auto right_arr = span[1].array;
     auto* out_span = result->array_span_mutable();
   
     auto* left_it = left_arr.GetValues<int32_t>(1);
     auto* right_it = right_arr.GetValues<int32_t>(1);
     auto* out_it = out_span->GetValues<int32_t>(1);
     for (int64_t i = 0; i < length; ++i) {
       if (ARROW_PREDICT_FALSE(AddWithOverflow(*left_it++, *right_it++, out_it++))) {
         auto left_valid = left_arr.IsValid(i);
         auto right_valid = right_arr.IsValid(i);
         if (left_valid && right_valid) {
           return Status::Invalid("overflow");
         }
       }
     }
   
     return Status::OK();
   }
   ```
   This achieves similar 2.5x time as the one above.
   
   > 1. detect potential overflow over all K pairs of values
   > 2. if no overflow was detected, add all K pairs or values
   
   I'm not sure how to detect overflow without performing the actual addition. So I implemented the following algo:
   1. Compute all additions and save the overflow status to a vector.
   2. Loop the overflow vector and check for nullity.
   
   ```cpp
   Status AddCheckedExec(KernelContext* ctx, const ExecSpan& span, ExecResult* result) {
     int64_t length = span.length;
     auto left_arr = span[0].array;
     auto right_arr = span[1].array;
     auto* out_span = result->array_span_mutable();
   
     auto* left_it = left_arr.GetValues<int32_t>(1);
     auto* right_it = right_arr.GetValues<int32_t>(1);
     auto* out_it = out_span->GetValues<int32_t>(1);
     std::vector<int> overflow(length, false);
   
     for (int64_t i = 0; i < length; ++i) {
       overflow[i] = AddWithOverflow(*left_it++, *right_it++, out_it++);
     }
   
     for (int64_t i = 0; i < length; ++i) {
       if (ARROW_PREDICT_FALSE(overflow[i])) {
         auto left_valid = left_arr.IsValid(i);
         auto right_valid = right_arr.IsValid(i);
         if (left_valid && right_valid) {
           return Status::Invalid("overflow");
         }
       }
     }
   
     return Status::OK();
   }
   ```
   This is however 5x slower. Writing to `overflow[i]` seems to be the bottleneck here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1719425896

   I definitely agree that passing batches is better than passing single values. A question is how do we handle nulls now? If we still use `VisitTwoBitBlocksVoid` like `ScalarBinaryNotNull` currently does. The number of `if`s will be a real headache.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1719442843

   > This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.
   
   IMHO This approach has the least number of checks so other algorithms will probably fall "flatter" in such cases. The result is ok if `no_overflow || left_is_null || right_is_null`, and overflows are more rare than nulls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1717857811

   > Does python introduce some overhead on this? It seems that gap on the pure C++ side (from @cyb70289) is smaller.
   
   Python introduces overhead, but that's why I chose a large array size. 
   
   @cyb70289 's implementation is different from our current implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1719387488

   Yes, add-with-overflow is basically a single fast CPU instruction, there's probably no point in trying to avoid it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "cyb70289 (via GitHub)" <gi...@apache.org>.
cyb70289 commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1719631513

   > > I wrote a simple kernel exec that checks the nullity only when overflow occurs:
   > 
   > This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.
   
   Maybe we can mask the value with valid bit to make sure no overflow can happen if either value is invalid?
   Somthing like:
   `ok &=  AddWithOverflow((*left_it++) & (-left_arr.IsValid(i)), (*right_it++) & (-right_arr.IsValid(i)), out_it++);`
   Should have stable performance, but guess this may introduce too many instructions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1717832656

   By the way, the benchmarks are posted don't involve any nulls, so the real-world differences might be more nuanced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1715930857

   (also @js8544 FYI)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #37678: [C++][Compute] Checked arithmetic functions are slow-ish

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #37678:
URL: https://github.com/apache/arrow/issues/37678#issuecomment-1719401998

   Since `MakeArithmeticFunctionNotNull` appears to generate very suboptimal code, perhaps we should also build a different and more performant abstraction? This would help more cases than just checked arithmetic.
   
   For example, `MakeArithmeticFunctionNotNull`  (or, actually, `ScalarBinaryNotNull`) appears to generate sub-performant kernels, perhaps it should be reworked for better performance?
   
   Basically, instead of declaring a passing `ScalarBinaryNotNull` a scalar operation, we should be able to pass it a batched operation. In other words, instead of:
   ```c++
   struct AddChecked {
     template <typename T, typename Arg0, typename Arg1>
     static Call(KernelContext*, Arg0 left, Arg1 right, Status* st) {
       T result = 0;
       if (ARROW_PREDICT_FALSE(AddWithOverflow(left, right, &result))) {
         *st = Status::Invalid("overflow");
       }
       return result;
     }
   };
   ```
   
   do something like:
   ```c++
   struct AddChecked {
     // Instructs ScalarBinaryNotNull to use batching
     static constexpr bool kBatched = true;
   
     template <typename T, typename Arg0, typename Arg1>
     static Status Call(KernelContext*, span<const Arg0> left, span<const Arg1> right, span<T> out) {
       T result = 0;
       const size_t len = out.size();
       bool ok = true;
       for (size_t i = 0; i < len; ++i) {
         ok &= AddWithOverflow(left[i], right[i], &out[i]);
       }
       return ok ? Status::OK() : Status::Invalid("overflow");
     }
   };
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org