You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Pavel Ganelin (Jira)" <ji...@apache.org> on 2021/02/23 18:17:00 UTC

[jira] [Created] (ARROW-11747) spark.createDataFrame does not support Pandas StringDtype extension type.

Pavel Ganelin created ARROW-11747:
-------------------------------------

             Summary: spark.createDataFrame does not support Pandas StringDtype extension type.
                 Key: ARROW-11747
                 URL: https://issues.apache.org/jira/browse/ARROW-11747
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Pavel Ganelin


The following test case demonstrates the problem:
{code:java}
import pandas as pd
from pyspark.sql import SparkSession, types

spark = SparkSession.builder.appName(__file__)\
    .config("spark.sql.execution.arrow.pyspark.enabled","true") \
    .getOrCreate()

good = pd.DataFrame([["abc"]], columns=["col"])

schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(good, schema=schema)

df.show()

bad = good.copy()
bad["col"]=bad["col"].astype("string")

schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(bad, schema=schema)

df.show()
{code}
The error:
{code:java}
C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)