You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "lukemanley (via GitHub)" <gi...@apache.org> on 2023/04/12 22:02:08 UTC

[GitHub] [arrow] lukemanley opened a new issue, #35088: [Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy

lukemanley opened a new issue, #35088:
URL: https://github.com/apache/arrow/issues/35088

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   In the example below, `arr2` and `arr3` are duration arrays with a single null element. 
   
   `arr2` is constructed from a list
   `arr3` is constructed from a numpy array
   
   Once constructed, they evaluate to being equal. 
   
   However, they exhibit different behavior once passed to `pyarrow.compute.subtract_checked`:
   
   ```
   import pyarrow as pa
   import pyarrow.compute as pc
   import numpy as np
   
   data1 = [86400000000]
   data2 = [None]
   data3 = np.array([None], dtype="timedelta64[ns]")
   
   arr1 = pa.array(data1, type=pa.duration("ns"))
   arr2 = pa.array(data2, type=pa.duration("ns"))
   arr3 = pa.array(data3, type=pa.duration("ns"))
   
   assert arr2 == arr3
   
   pc.subtract_checked(arr1, arr2)  # ok
   pc.subtract_checked(arr1, arr3)  # ArrowInvalid: overflow 
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35088: [Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35088:
URL: https://github.com/apache/arrow/issues/35088#issuecomment-1506526303

   @lukemanley thanks for the report. This is an interesting bug .. The difference between both arrays that appear to be the same, is that the actual data buffer is different, because of being created differently (but the data is being masked because they are null, and so the actual value "behind" that null shouldn't matter in theory). 
   "Viewing" the data buffer as an int64 array to see the values:
   
   ```
   In [20]: pa.Array.from_buffers(pa.int64(), 1, [None, arr2.buffers()[1]])
   Out[20]: 
   <pyarrow.lib.Int64Array object at 0x7f4c1af64820>
   [
     0
   ]
   
   In [21]: pa.Array.from_buffers(pa.int64(), 1, [None, arr3.buffers()[1]])
   Out[21]: 
   <pyarrow.lib.Int64Array object at 0x7f4bf5998dc0>
   [
     -9223372036854775808
   ]
   ```
   
   And so my assumption is that the overflow comes from actually subtracting the values in the second case (`86400000000 - (-9223372036854775808)` would indeed overflow. 
   
   However, the way that the "substract_checked" is implemented, _should_ normally only do the actual substraction for data values that are not being masked as null, exactly to avoid situations like the above. But it seems there is a bug in this mechanism to skip values behind nulls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35088: [Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35088:
URL: https://github.com/apache/arrow/issues/35088#issuecomment-1512621733

   It's also tied to duration.  The fix is https://github.com/westonpace/arrow/commit/ec9a5a433e27b56c85214bee77fe3b1be74c07dd although a proper PR should add tests as well as check the other checked functions (e.g. add_checked, etc.)
   
   It turns out that the "skip nulls" behavior is something that has to be specified per-kernel and it wasn't being specified for the duration kernels.  Is this something we need to fit into 12.0.0?  If so I can try and carve out some time later this week for a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lukemanley commented on issue #35088: [Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy

Posted by "lukemanley (via GitHub)" <gi...@apache.org>.
lukemanley commented on issue #35088:
URL: https://github.com/apache/arrow/issues/35088#issuecomment-1509992225

   Thanks for the explanation. It looks like numpy uses that value (min int64) for NaT:
   
   ```
   In [1]: import numpy as np
   
   In [2]: np.datetime64("NaT").astype(int)
   Out[2]: -9223372036854775808
   
   In [3]: np.array([-9223372036854775808], dtype="m8[ns]")
   Out[3]: array(['NaT'], dtype='timedelta64[ns]')
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org