You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jalpan Randeri (Jira)" <ji...@apache.org> on 2019/11/13 05:19:00 UTC

[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

    [ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973023#comment-16973023 ] 

Jalpan Randeri commented on SPARK-25351:
----------------------------------------

Yes, I will take up this changes as part of my PR

> Handle Pandas category type when converting from Python with Arrow
> ------------------------------------------------------------------
>
>                 Key: SPARK-25351
>                 URL: https://issues.apache.org/jira/browse/SPARK-25351
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 2.3.1
>            Reporter: Bryan Cutler
>            Priority: Major
>              Labels: bulk-closed
>
> There needs to be some handling of category types done when calling {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>    A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A      object
> B    category
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>    1667         spark_type = ArrayType(from_arrow_type(at.value_type))
>    1668     else:
> -> 1669         raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
>    1670     return spark_type
>    1671 
> TypeError: Unsupported type in conversion from Arrow: dictionary<values=string, indices=int8, ordered=0>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org