You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/06/12 04:51:00 UTC

[jira] [Commented] (SPARK-31930) Pandas_udf does not properly return ArrayType

    [ https://issues.apache.org/jira/browse/SPARK-31930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133921#comment-17133921 ] 

Hyukjin Kwon commented on SPARK-31930:
--------------------------------------

Seems like it depends on which version you use. I can't reproduce this in the latest master:

{code}
+-----+----------------------------------------------------------------+
|group|list_col                                                        |
+-----+----------------------------------------------------------------+
|B    |[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]|
|C    |[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]|
|A    |[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]|
+-----+----------------------------------------------------------------+
{code}

Let's better identify which JIRA fixed this and see if we can port back. Or it might be fixed in the upper version of pyarrow or pandas.

> Pandas_udf does not properly return ArrayType
> ---------------------------------------------
>
>                 Key: SPARK-31930
>                 URL: https://issues.apache.org/jira/browse/SPARK-31930
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>         Environment: Azure Databricks
>            Reporter: Julia Maddalena
>            Priority: Major
>
> Attempting to return an ArrayType() from pandas_udf reveals a consistent error with skipping specific list elements upon return. 
> We were able to create a reproducible example, as below. 
> {code:java}
> df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 10)], ['group', 'val'])
> @pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
> def get_list(x):
>     return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]
> df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False) {code}
> {code:java}
> +-----+-----------------------------+
> |group|list_col                     |
> +-----+-----------------------------+
> |B    |[[1, 1],,,,,, [7, 7], [8, 8]]|
> |C    |[[1, 1],,,,,, [7, 7], [8, 8]]|
> |A    |[[1, 1],,,,,, [7, 7], [8, 8]]|
> +-----+-----------------------------+
> {code}
>  
>  
> In every example we've come up with, it consistently replaces elements 2-6 with None (as well as some later elements too). 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org