You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "randolf-scholz (via GitHub)" <gi...@apache.org> on 2023/04/07 18:48:48 UTC

[GitHub] [arrow] randolf-scholz opened a new issue, #34976: Allow `pyarrow.compute.cast` to coerce errors to null values.

randolf-scholz opened a new issue, #34976:
URL: https://github.com/apache/arrow/issues/34976

   ### Describe the enhancement requested
   
   I have large array consisting of string data. Unfortunately, there is numerical data mixed with categorical data. `pyarrow` seems to offer no straightforward way to separate them.
   
   ```python
   import pyarrow as pa
   
   arr = pa.array(["3", "+5", "-4.2", "1,000.00", "foo", "7e-3"], type="string")
   print(pa.compute.utf8_is_numeric(arr))  # ynnnnn
   pa.compute.cast(arr, pa.float32())  # ArrowInvalid: Failed to parse string: 'foo' as a scalar of type float
   ```
   
   basically, it would be great to have either (or both)
   
   - function that returns boolean mask whether string can be cast to float
   - add option to `pyarrow.compute.cast` that replaces errors with null values.
   
   My current workaround is to use cast to pandas: `pd.to_numeric(pd.Series(arr, dtype="string[pyarrow]"), errors="coerce")`.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34976: [Python] Allow `pyarrow.compute.cast` to coerce errors to null values

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34976:
URL: https://github.com/apache/arrow/issues/34976#issuecomment-1503481165

   @randolf-scholz thanks for raising the issue! That would indeed be useful addition. I am going to close this as a duplicate of https://github.com/apache/arrow/issues/20486, since that already exists for adding such an option to the cast kernel.
   
   > function that returns boolean mask whether string can be cast to float
   
   I understand that this would give a workaround, but I would say that ideally we have that option, and then this one might be a bit specific? 
   As you show, we already have the `utf8_is_numeric` kernel, but the problem is that this is too simplistic and fails if there is a "+" or "-" in the string, or a thousands separate, ..?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] randolf-scholz commented on issue #34976: [Python] Allow `pyarrow.compute.cast` to coerce errors to null values

Posted by "randolf-scholz (via GitHub)" <gi...@apache.org>.
randolf-scholz commented on issue #34976:
URL: https://github.com/apache/arrow/issues/34976#issuecomment-1503618134

   @jorisvandenbossche 
   
   > As you show, we already have the utf8_is_numeric kernel, but the problem is that this is too simplistic and fails if there is a "+" or "-" in the string, or a thousands separate, ..?
   
   The `utf8_is_numeric` apparently checks if the character is marked as belonging to some numeric category by unicode:
   
   https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc#L144-L151
   
   As my example shows it is pretty much unrelated to whether a string can be casted to float (it will return `False` for the string "1.0", because "." is not a character classified as numeric by Unicode). Therefore, it would be nice to have a utility function `utf8_is_float` that returns `True` if the string can be cast to `float`. The coersion makes it slightly annoying to figure out for which non-null items the conversion failed because we have to do some mask arithmetic
   
   ```python
   conversion_failed_mask = pa.compute.and_(
       array.cast(pa.float32(), errors="coerce").is_null(),
       pa.compute.invert(array.is_null())
   )
   non_float_items = array.filter(conversion_failed_mask)
   ```
   
   vs.
   
   ```pyhton
   non_float_items =  array.filter(pa.compute.invert(cp.compute.utf8_is_float(array)))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org