You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jalpan Randeri (Jira)" <ji...@apache.org> on 2019/11/13 05:19:00 UTC
[jira] [Commented] (SPARK-25351) Handle Pandas category type when
converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973023#comment-16973023 ]
Jalpan Randeri commented on SPARK-25351:
----------------------------------------
Yes, I will take up this changes as part of my PR
> Handle Pandas category type when converting from Python with Arrow
> ------------------------------------------------------------------
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 2.3.1
> Reporter: Bryan Cutler
> Priority: Major
> Labels: bulk-closed
>
> There needs to be some handling of category types done when calling {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. Without Arrow, Spark casts each element to the category. For example
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]:
> A B
> 0 a a
> 1 b b
> 2 c c
> 3 a a
> In [5]: pdf.dtypes
> Out[5]:
> A object
> B category
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> | A| B|
> +---+---+
> | a| a|
> | b| b|
> | c| c|
> | a| a|
> +---+---+
> In [10]: df.printSchema()
> root
> |-- A: string (nullable = true)
> |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)
> 1667 spark_type = ArrayType(from_arrow_type(at.value_type))
> 1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
> 1670 return spark_type
> 1671
> TypeError: Unsupported type in conversion from Arrow: dictionary<values=string, indices=int8, ordered=0>
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org