You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "rohanjain101 (via GitHub)" <gi...@apache.org> on 2023/04/05 06:20:43 UTC

[GitHub] [arrow] rohanjain101 opened a new issue, #34901: Inconsistent cast behavior between array and scalar for int64

rohanjain101 opened a new issue, #34901:
URL: https://github.com/apache/arrow/issues/34901

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   ```
   >>> scal = pa.scalar(6312878760374611856, type=pa.int64())
   >>> scal.cast(pa.float64())
   <pyarrow.DoubleScalar: 6.312878760374612e+18>
   >>> arr = pa.array([6312878760374611856], type=pa.int64())
   >>> arr.cast(pa.float64())
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow\array.pxi", line 926, in pyarrow.lib.Array.cast
     File "C:\pandas2_ps_04323\lib\site-packages\pyarrow\compute.py", line 391, in cast
       return call_function("cast", [arr], options)
     File "pyarrow\_compute.pyx", line 560, in pyarrow._compute.call_function
     File "pyarrow\_compute.pyx", line 355, in pyarrow._compute.Function.call
     File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Integer value 6312878760374611856 not in range: -9007199254740992 to 9007199254740992
   >>>
   ```
   
   Behavior is not consistent in casting between array and scalar. The array behavior of raising does not seem correct, as it seems an int64 should always be able to be casted to float64.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499529735

Hi @rohanjain101,

I am able to reproduce this example on pyarrow v11 using macos 13.3.

What you are experiencing is the difference between safe vs unsafe casting, since the number you chose probably can not be fully represented in the new type. It is not true that all int64 values can be safely converted to float64. Due to the way precision works in floating point, there are numbers that may be skipped that could otherwise be represented by int64. See https://en.wikipedia.org/wiki/Double-precision_floating-point_format, which states: only `Integers from −253 to 253 (−9,007,199,254,740,992 to 9,007,199,254,740,992) can be exactly represented`. My guess is the underlying implementation enforces this hard limit, since technically I believe there are some int64 numbers that can be larger and still represented safely when cast to float64 (such as 18,014,398,509,481,984, but not 18,014,398,509,481,983).
```
>>> arr = pa.array([18014398509481984], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992
```

It appears the scalar cast defaults to allow unsafe casting, while the array defaults to safe casting. You can allow unsafe casting in the array like this:
```
>>> arr.cast(pa.float64(), safe=False)
<pyarrow.lib.DoubleArray object at 0x126a40ee0>
[
6.312878760374612e+18
]
```

There are no options to choose safe vs unsafe cast in scalar APIs at the moment. The documentation does state the scalar will perform a safe cast, though, which it is not doing: https://arrow.apache.org/docs/python/generated/pyarrow.Int64Scalar.html#pyarrow.Int64Scalar

This is either a bug in scalar safe casting or the documentation is wrong. Ideally, Scalars can also allow you to choose safe vs unsafe casting with an option. Either way, some more investigation is still needed.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499603765

   I'd recommend filing an issue with numpy about this, too:
   ```
   # Bug: No safety error for initial int64 -> float64 conversion
   >>> np.array([18014398509481983]).astype("float64", casting="safe").astype(str)
   array(['1.8014398509481984e+16'], dtype='<U32')
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499589776

   For pyarrow, we should probably:
   1) Allow both safe and unsafe conversion options for scalar APIs (feature)
   2) Default to safe conversion for scalars, which appears is not happening (bug)
   3) Look into allowing safe conversion from int <-> float for large numbers (feature)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rohanjain101 commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "rohanjain101 (via GitHub)" <gi...@apache.org>.

rohanjain101 commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499591766

   But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1544499956

   > So this issue is about allowing these sorts of casts when `safe=true` and the value happens to be representable? I thought this was an ask for a new kind of unsafe cast.
   
   Yes, from my understanding.
   
   The original issue reported is now fixed (https://github.com/apache/arrow/pull/35395). We can either repurpose this issue for the above feature request, or close this issue and file a new one.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin closed issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin closed issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64
URL: https://github.com/apache/arrow/issues/34901


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499584242

   > @danepitkin thank you for the clarification. In numpy however, the cast succeeds, it seems as if full value is preserved:
   > 
   > > > > np.array([18014398509481984]).astype("float64")
   > > > > array([1.80143985e+16])
   > 
   > Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference?
   
   In your example, 18,014,398,509,481,984 can be converted to float64 safely according to the floating point specification so it is not a good example to use. Instead let's try 18,014,398,509,481,983, which is not a multiple of 2 (required by integers between 2^53 and 2^54 for safe conversion).
   
   You will lose data in this numpy cast. (And yes, my guess is they adhere to the floating point spec slightly differently purely based on the different behavior).
   ```
   >>> np.array([18014398509481983]).astype("float64").astype("int64") == 18014398509481983
   array([False])
   
   >>> np.array([18014398509481983]).astype("float64").astype("int64")
   array([18014398509481984])
   ```
   
   Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.
   ```
   >>> type(np.array([18014398509481984])[0])
   <class 'numpy.int64'>
   
   
   # Bug? No safety error for int64 -> float64
   >>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64") == 18014398509481983
   array([False])
   
   
   # Good? Errors out on float64 -> int64, but the bug happened in the int64 -> float64 and was somehow propagated..
   >>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64", casting="safe") == 18014398509481983
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1503400417

   > Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.
   
   Sidenote: numpy doesn't really have the same concept of "safe" casting as how we use this in pyarrow. In pyarrow this depends on the _values_, while in numpy this is just a property of a cast between two _dtypes_. So to say if a cast from one dtype to another is safe or not, numpy needs to make some generalization/assumption, and so it seems it decided that casting int to float is generally safe (indeed, except for large ints) and casting floats to ints is generally not safe (indeed, except if you have rounded floats):
   
   ```python
   >>> np.can_cast(np.int64(), np.float64(), casting="safe")
   True
   >>> np.can_cast(np.float64(), np.int64(), casting="safe")
   False
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499593788

   > But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983?
   
   If pyarrow were to follow the floating point specification exactly, then yes it would. Right now, it seems to be a limitation of the implementation. You could argue that option (3) above should be a bug instead of a feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1516820479

> I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast?

The rounding kernel allows for conversion from one valid IEEE value to another. This rounding is about going from an infinite precision value that cannot be represented in IEEE to a valid IEEE value.

I'll walk my comment back though.

Technically, IEEE rounding is something that has to be considered in just about any operation (e.g. addition, subtraction) because the infinite-precision result isn't representable.

In practice, we'd probably be better off just saying we always use TIE_TO_EVEN and it's not configurable. This is what every other engine seems to do (TIE_TO_EVEN is the default for most / all modern CPUs).

> It's not super clear from the name, but so we already use the existing allow_float_truncate options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).

I didn't realize this. So this issue is about allowing these sorts of casts when `safe=true` and the value happens to be representable? I thought this was an ask for a new kind of unsafe cast.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1544522312

   I captured the feature request to safely cast representable int64s larger than 2^53 to float64 here: https://github.com/apache/arrow/issues/35563.
   
   Closing this issue since the original bug report is resolved in https://github.com/apache/arrow/pull/35395


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rohanjain101 commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "rohanjain101 (via GitHub)" <gi...@apache.org>.

rohanjain101 commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1509992674

   Regarding 3, I think atleast the error should be improved if its an internal limitation, for example:
   
   ```
   >>> pa.array([18014398509481984], type=pa.float64())
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow\array.pxi", line 320, in pyarrow.lib.array
     File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
     File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 is outside of the range exactly representable by a IEEE 754 double precision value
   ```
   
   The error message says that 18014398509481984  is not exactly representable by an IEE 754 double, when according to the IEEE 754, it can be represented exactly. Should the error be clarified to say its an internal limitation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1516109215

   > If we are discussing ideal behavior then I think something like...
   > 
   > ```
   > # Allow converting integers to floats when the integer cannot be exactly represented by
   > # an IEEE-754 float and must be rounded.
   > bool allow_float_inexact;
   > ```
   
   It's not super clear from the name, but so we already use the existing `allow_float_truncate` options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).
   
   ```python
   >>> pa.array([18014398509481984], type=pa.int64()).cast(pa.float64())
   ...
   ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992
   
   >>> pa.array([18014398509481984], type=pa.int64()).cast(options=pc.CastOptions(pa.float64(), allow_float_truncate=True))
   <pyarrow.lib.DoubleArray object at 0x7f4e1f7dee00>
   [
     1.8014398509481984e+16
   ]
   ```
   
   But your suggestion would be that an option like `allow_float_inexact` would use a different logic to determine when the cast is allowed (actually check if the result value is inexact, instead of checking it is outside of the generally safe range?)
   
   (personally I would say that `allow_float_inexact` is a better name for what we currently call `allow_float_truncate` for the int->float cast)
   
   
   
   > If we wanted to be even more extreme :smile: we could have:
   > 
   > ```
   > enum class IeeeRoundingMode : int8_t {
   >   TIE_TO_EVEN = 0,
   >   ...
   > ```
   
   I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1503439087

   > For pyarrow, we should probably:
   > 
   > 1. Allow both safe and unsafe conversion options for scalar APIs (feature)
   > 2. Default to safe conversion for scalars, which appears is not happening (bug)
   
   Yes, fully agreed, I opened a separate issue for specifically this aspect: https://github.com/apache/arrow/issues/35040
   
   > 3. Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)
   
   For checking the safety of casting int to float, we indeed use this fixed range:
   
   https://github.com/apache/arrow/blob/e488942cd552ac36a46d40477c1b0326a626ed98/cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc#L171-L250
   
   I am not fully sure this is something we should change. First, I think this is a lot simpler in implementation to just check for values within the range, compared to checking for certain integers that can still be represented as float outside of that range. But also for the user this seems easier to understand and gives more consistent behaviour? (just everything outside of that range will fail with the default `safe=True`, and not depending on the exact value)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34901: [Python] Inconsistent cast behavior between array and scalar for int64

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1514979073

   > Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)
   
   The options within the C++ lib are very fine-grained already:
   
   ```
     bool allow_int_overflow;
     bool allow_time_truncate;
     bool allow_time_overflow;
     bool allow_decimal_truncate;
     bool allow_float_truncate;
     // Indicate if conversions from Binary/FixedSizeBinary to string must
     // validate the utf8 payload.
     bool allow_invalid_utf8;
   ```
   
   If we are discussing ideal behavior then I think something like...
   
   ```
   # Allow converting integers to floats when the integer cannot be exactly represented by
   # an IEEE-754 float and must be rounded.
   bool allow_float_inexact;
   ```
   
   ...would be very reasonable.  If we wanted to be even more extreme :smile: we could have:
   
   ```
   enum class IeeeRoundingMode : int8_t {
     TIE_TO_EVEN = 0,
     TIE_AWAY_FROM_ZERO = 1,
     TOWARD_ZERO = 2,
     TOWARD_POSITIVE_INFINITY = 3,
     TOWARD_NEGATIVE_INFINITY = 4,
     ERROR = 5
   };
   ```
   
   However, all of point number 3 sounds like a separate issue from this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rohanjain101 commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "rohanjain101 (via GitHub)" <gi...@apache.org>.

rohanjain101 commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499544223

   @danepitkin thank you for the clarification. In numpy however, the cast succeeds, it seems as if full value is preserved:
   
   >>> np.array([18014398509481984]).astype("float64")
   array([1.80143985e+16])
   >>>
   
   Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34901: Inconsistent cast behavior between array and scalar for int64

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34901:
URL: https://github.com/apache/arrow/issues/34901#issuecomment-1499612765

   Thanks for raising this issue by the way. I don't think I expressed that earlier. Your contributions are appreciated!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org