You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "rohanjain101 (via GitHub)" <gi...@apache.org> on 2023/04/05 19:14:04 UTC

[GitHub] [arrow] rohanjain101 opened a new issue, #34909: Incorrect average result with int64 array

rohanjain101 opened a new issue, #34909:
URL: https://github.com/apache/arrow/issues/34909

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   ```
   >>> arr = pa.array([-1303487490025821099, -8371390547526583103, -2572374159461887095], type=pa.int64())
   >>> pa.compute.mean(arr)
   <pyarrow.DoubleScalar: 2.0664972922317535e+18>
   >>>
   ```
   
   When using numpy:
   
   ```
   >>> arr.to_numpy().mean()
   -4.0824173990047636e+18
   >>>
   ```
   
   Seems like overflow case is not handled correctly when using arrow array.
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rohanjain101 commented on issue #34909: Incorrect average result with int64 array

Posted by "rohanjain101 (via GitHub)" <gi...@apache.org>.
rohanjain101 commented on issue #34909:
URL: https://github.com/apache/arrow/issues/34909#issuecomment-1498004791

   It seems like mean is calculated by taking sum divided by count, instead of doing it iteratively, which will not handle overflow case correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34909: [C++] mean overflows if numeric sum is larger than int64 max

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34909:
URL: https://github.com/apache/arrow/issues/34909#issuecomment-1499342801

   Numpy will always perform the actual operation on floats, so if calculating the mean of integers, those are cast to float64 at the start. In addition, they also cast float16 to float32 for the intermediates, for the other float inputs, they keep the precision.
   
   In addition to doing this with floats, to improve the precision of the operation, pandas also recently switched to use Kahan summation (https://en.wikipedia.org/wiki/Kahan_summation_algorithm). I think numpy uses pairwise summation instead (https://en.wikipedia.org/wiki/Pairwise_summation)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser commented on issue #34909: [C++] mean overflows if numeric sum is larger than int64 max

Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser commented on issue #34909:
URL: https://github.com/apache/arrow/issues/34909#issuecomment-1498214758

   This is actually an issue with the underlying C++ compute function.
   
   Here is a clear reprex:
   
   ```python
    import numpy as np; import pyarrow as pa
    pa.compute.mean(pa.array([np.iinfo(np.int64).max, 1]))
    <pyarrow.DoubleScalar: -4.611686018427388e+18>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou closed issue #34909: [C++] mean overflows if numeric sum is larger than int64 max

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou closed issue #34909: [C++] mean overflows if numeric sum is larger than int64 max
URL: https://github.com/apache/arrow/issues/34909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org