You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/09/05 21:10:00 UTC
[jira] [Created] (SPARK-25351) Handle Pandas category type when
converting from Python with Arrow
Bryan Cutler created SPARK-25351:
------------------------------------
Summary: Handle Pandas category type when converting from Python with Arrow
Key: SPARK-25351
URL: https://issues.apache.org/jira/browse/SPARK-25351
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 2.3.1
Reporter: Bryan Cutler
There needs to be some handling of category types done when calling {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. Without Arrow, Spark casts each element to the category. For example
{noformat}
In [1]: import pandas as pd
In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
In [3]: pdf["B"] = pdf["A"].astype('category')
In [4]: pdf
Out[4]:
A B
0 a a
1 b b
2 c c
3 a a
In [5]: pdf.dtypes
Out[5]:
A object
B category
dtype: object
In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
In [8]: df = spark.createDataFrame(pdf)
In [9]: df.show()
+---+---+
| A| B|
+---+---+
| a| a|
| b| b|
| c| c|
| a| a|
+---+---+
In [10]: df.printSchema()
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
In [19]: df = spark.createDataFrame(pdf)
1667 spark_type = ArrayType(from_arrow_type(at.value_type))
1668 else:
-> 1669 raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
1670 return spark_type
1671
TypeError: Unsupported type in conversion from Arrow: dictionary<values=string, indices=int8, ordered=0>
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org