You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/06 15:02:56 UTC

[GitHub] [spark] moskvax opened a new pull request #28743: [SPARK-31920] Fix pandas conversion using Arrow with __arrow_array__ columns

moskvax opened a new pull request #28743:
URL: https://github.com/apache/spark/pull/28743

### What changes were proposed in this pull request?

1. Cast pandas DataFrame columns to object before passing to `pa.Schema.from_pandas`, to avoid a potential failed type check for `numpy.dtype` which occurs with pyarrow < 0.17.0,
2. Check for the implementation of `__arrow_array__` before passing a mask to `pa.Array.from_pandas`, which will raise an exception if both `__arrow_array__` is implemented and a mask is passed in.

### Why are the changes needed?

These changes allow usage of pandas DataFrames which contain ExtensionDtype columns that are backed by arrays that implement `__arrow_array__`. DataFrames containing such columns will be returned when specifying an ExtensionDtype-extending pandas type in the `dtype` parameter when constructed, and can also be created via calling `convert_dtypes` on an existing DataFrame.

### Does this PR introduce _any_ user-facing change?

Yes. Users will be able to convert a wider variety of pandas DataFrames into Spark DataFrames using any currently released pyarrow version > 0.15.1. Prior to this fix, the Arrow conversion path would not work with these DataFrames.

### How was this patch tested?

Tests were added to cover the cases of converting from pandas DataFrames with `IntegerArray` and `StringArray` backed columns. A typo was also fixed in a recently added test.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org