You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "chaokunyang (via GitHub)" <gi...@apache.org> on 2023/06/06 03:18:47 UTC

[GitHub] [arrow] chaokunyang opened a new issue, #35925: [C++] ArrowInvalid: Failed to parse string: '2acv' as a scalar of type double

chaokunyang opened a new issue, #35925:
URL: https://github.com/apache/arrow/issues/35925

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Arrow raise `ArrowInvalid` exception when parsing non-numerical value to float failed even `allow_invalid_utf8` is true:
   ```python
   str_arr = pa.array(["100", "200.22", "2acv"])
   str_arr.cast(options=pc.CastOptions(target_type=pa.float64(), allow_invalid_utf8=True))
   ```
   
   Exception:
   ```
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   Cell In[3], line 3
         1 str_arr = pa.array(["100", "200.22", "2acv"])
         2 print(pa.__version__)
   ----> 3 str_arr.cast(options=pc.CastOptions(target_type=pa.float64(), allow_invalid_utf8=True))
   
   File ~/anaconda3/envs/py3.8/lib/python3.8/site-packages/pyarrow/array.pxi:935, in pyarrow.lib.Array.cast()
   
   File ~/anaconda3/envs/py3.8/lib/python3.8/site-packages/pyarrow/compute.py:400, in cast(arr, target_type, safe, options, memory_pool)
       398     else:
       399         options = CastOptions.safe(target_type)
   --> 400 return call_function("cast", [arr], options, memory_pool)
   
   File ~/anaconda3/envs/py3.8/lib/python3.8/site-packages/pyarrow/_compute.pyx:572, in pyarrow._compute.call_function()
   
   File ~/anaconda3/envs/py3.8/lib/python3.8/site-packages/pyarrow/_compute.pyx:367, in pyarrow._compute.Function.call()
   
   File ~/anaconda3/envs/py3.8/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
   
   File ~/anaconda3/envs/py3.8/lib/python3.8/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()
   
   ArrowInvalid: Failed to parse string: '2acv' as a scalar of type double
   ```
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35925: [C++] ArrowInvalid: Failed to parse string: '2acv' as a scalar of type double

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35925:
URL: https://github.com/apache/arrow/issues/35925#issuecomment-1580107037

   The explanation for `allow_invalid_utf8` in the docstring (*"Whether producing invalid utf8 data is allowed when casting."*) could be more explicit about that this keyword is only meant for casting _to_ string: i.e. when casting binary data to string, should the bytes be validated to ensure it is valid UTF-8. 
   
   I assume what you are looking for is an option for allowing unparsable strings to be converted to null while casting strings to numeric? 
   Such an option doesn't exist at the moment, but we have had previous feature requests about this as well, eg https://github.com/apache/arrow/issues/20486 and https://github.com/apache/arrow/issues/34976
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] chaokunyang commented on issue #35925: [C++] ArrowInvalid: Failed to parse string: '2acv' as a scalar of type double

Posted by "chaokunyang (via GitHub)" <gi...@apache.org>.
chaokunyang commented on issue #35925:
URL: https://github.com/apache/arrow/issues/35925#issuecomment-1580717271

   > The explanation for `allow_invalid_utf8` in the docstring (_"Whether producing invalid utf8 data is allowed when casting."_) could be more explicit about that this keyword is only meant for casting _to_ string: i.e. when casting binary data to string, should the bytes be validated to ensure it is valid UTF-8.
   > 
   > I assume what you are looking for is an option for allowing unparsable strings to be converted to null while casting strings to numeric? Such an option doesn't exist at the moment, but we have had previous feature requests about this as well, eg #20486 and #34976
   
   Yes,  we need an option for allowing unparsable strings to be converted to null. Currently we cast to polars, using polars cast and cast it to arrow back:
   ```
   double_array = (
       pl.from_arrow(utf8_array).cast(pl.Float64, strict=False).to_arrow()
   )
   # Set safe=False to bypass float to int truncate:
   # ArrowInvalid: Float value 1.1 was truncated converting to int32
   safe = False if self.dtype in _int_type else True
   numeric_array = double_array.cast(self.dtype, safe=safe)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org