You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "randolf-scholz (via GitHub)" <gi...@apache.org> on 2023/04/11 15:30:00 UTC

[GitHub] [arrow] randolf-scholz commented on issue #34976: [Python] Allow `pyarrow.compute.cast` to coerce errors to null values

randolf-scholz commented on issue #34976:
URL: https://github.com/apache/arrow/issues/34976#issuecomment-1503618134

   @jorisvandenbossche 
   
   > As you show, we already have the utf8_is_numeric kernel, but the problem is that this is too simplistic and fails if there is a "+" or "-" in the string, or a thousands separate, ..?
   
   The `utf8_is_numeric` apparently checks if the character is marked as belonging to some numeric category by unicode:
   
   https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc#L144-L151
   
   As my example shows it is pretty much unrelated to whether a string can be casted to float (it will return `False` for the string "1.0", because "." is not a character classified as numeric by Unicode). Therefore, it would be nice to have a utility function `utf8_is_float` that returns `True` if the string can be cast to `float`. The coersion makes it slightly annoying to figure out for which non-null items the conversion failed because we have to do some mask arithmetic
   
   ```python
   conversion_failed_mask = pa.compute.and_(
       array.cast(pa.float32(), errors="coerce").is_null(),
       pa.compute.invert(array.is_null())
   )
   non_float_items = array.filter(conversion_failed_mask)
   ```
   
   vs.
   
   ```pyhton
   non_float_items =  array.filter(pa.compute.invert(cp.compute.utf8_is_float(array)))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org