You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Pavel Ganelin (Jira)" <ji...@apache.org> on 2021/02/23 18:17:00 UTC
[jira] [Created] (ARROW-11747) spark.createDataFrame does not
support Pandas StringDtype extension type.
Pavel Ganelin created ARROW-11747:
-------------------------------------
Summary: spark.createDataFrame does not support Pandas StringDtype extension type.
Key: ARROW-11747
URL: https://issues.apache.org/jira/browse/ARROW-11747
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Pavel Ganelin
The following test case demonstrates the problem:
{code:java}
import pandas as pd
from pyspark.sql import SparkSession, types
spark = SparkSession.builder.appName(__file__)\
.config("spark.sql.execution.arrow.pyspark.enabled","true") \
.getOrCreate()
good = pd.DataFrame([["abc"]], columns=["col"])
schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(good, schema=schema)
df.show()
bad = good.copy()
bad["col"]=bad["col"].astype("string")
schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(bad, schema=schema)
df.show()
{code}
The error:
{code:java}
C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)