You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Li Jin (JIRA)" <ji...@apache.org> on 2018/04/09 20:40:00 UTC

[jira] [Comment Edited] (SPARK-23929) pandas_udf schema mapped by position and not by name

    [ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431188#comment-16431188 ] 

Li Jin edited comment on SPARK-23929 at 4/9/18 8:39 PM:
--------------------------------------------------------

I think there are pros and cons for both matching by position and by name.

Match by position give the user the flexibility of not needing to spell out column names in the udf. e.g.
{code:java}
@pandas_udf("id long, v double, v1 double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    id = pdf.id
    vs = # ....
    return pd.DataFrame([[id] + vs])
{code}
Match by name give the user the flexibility of reorder columns. Admittedly, the choice is somewhat arbitrary now. But I am also not sure if one is strictly better. [~omri374] in what case would you have out of order return value in your UDF? I am trying to see if that's more common.


was (Author: icexelloss):
I think there are pros and cons for both matching by position and by name.

Match by position give the user the flexibility of not needing to spell out column names in the udf. e.g.
{code:java}
@pandas_udf("id long, v double, v1 double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
    id = pdf.id
    vs = # ....
    return pd.DataFrame([id] + vs])
{code}
Match by name give the user the flexibility of reorder columns. Admittedly, the choice is somewhat arbitrary now. But I am also not sure if one is strictly better. [~omri374] in what case would you have out of order return value in your UDF? I am trying to see if that's more common.

> pandas_udf schema mapped by position and not by name
> ----------------------------------------------------
>
>                 Key: SPARK-23929
>                 URL: https://issues.apache.org/jira/browse/SPARK-23929
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: PySpark
> Spark 2.3.0
>  
>            Reporter: Omri
>            Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+-------------------+
> | id|                  v|
> +---+-------------------+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+-------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org