You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/09/05 21:10:00 UTC

[jira] [Created] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

Bryan Cutler created SPARK-25351:
------------------------------------

             Summary: Handle Pandas category type when converting from Python with Arrow
                 Key: SPARK-25351
                 URL: https://issues.apache.org/jira/browse/SPARK-25351
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 2.3.1
            Reporter: Bryan Cutler


There needs to be some handling of category types done when calling {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  Without Arrow, Spark casts each element to the category. For example 

{noformat}
In [1]: import pandas as pd

In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})

In [3]: pdf["B"] = pdf["A"].astype('category')

In [4]: pdf
Out[4]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

In [5]: pdf.dtypes
Out[5]: 
A      object
B    category
dtype: object

In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)

In [8]: df = spark.createDataFrame(pdf)

In [9]: df.show()
+---+---+
|  A|  B|
+---+---+
|  a|  a|
|  b|  b|
|  c|  c|
|  a|  a|
+---+---+


In [10]: df.printSchema()
root
 |-- A: string (nullable = true)
 |-- B: string (nullable = true)

In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)

In [19]: df = spark.createDataFrame(pdf)   

   1667         spark_type = ArrayType(from_arrow_type(at.value_type))
   1668     else:
-> 1669         raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
   1670     return spark_type
   1671 

TypeError: Unsupported type in conversion from Arrow: dictionary<values=string, indices=int8, ordered=0>
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org