You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "Kimahriman (via GitHub)" <gi...@apache.org> on 2023/07/15 12:26:01 UTC

[GitHub] [spark] Kimahriman commented on pull request #41569: [SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow

Kimahriman commented on PR #41569:
URL: https://github.com/apache/spark/pull/41569#issuecomment-1636752962

Attempted a PR for the arrow issue: https://github.com/apache/arrow/pull/36701. Though after doing some digging I think that was only causing one test to fail that's a weird case of trying to convert a double to a string as part of the arrow conversion. Arrow already supports converting pandas series of strings to large_string type (when the numpy type is object), but not a numpy string list (when numpy type is utf8). The former goes through https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/numpy_to_arrow.cc#L324C9-L324C26 instead of the other `Visit` paths.

The other test failures were just due to arrow not having large type support when looking up the numpy type for an arrow type (also added that to the above PR). That can be fixed on the Spark side by just using np.object explicitly for string and binary types, but hitting a weird new test issue I'm trying to figure out.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org