You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2019/04/30 22:50:00 UTC

[jira] [Comment Edited] (SPARK-27519) Pandas udf corrupting data

    [ https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830743#comment-16830743 ] 

Bryan Cutler edited comment on SPARK-27519 at 4/30/19 10:49 PM:
----------------------------------------------------------------

Problem does not happen when running the latest master. Marking resolved.


was (Author: bryanc):
Problem does not happen when running the latest master.

> Pandas udf corrupting data
> --------------------------
>
>                 Key: SPARK-27519
>                 URL: https://issues.apache.org/jira/browse/SPARK-27519
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Jeff gold
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: Pandas UDF Bug.py
>
>
> While trying to use a pandas udf, i sent the udf 2 columns, a string and a list of a list of strings. The second argument structure for example: [['1'],['2'],['3']]
> But when getting this same value in the udf, i receive something like this: [['1','2'],['3'],[]]
> I checked and the same row in the table has the list with the correct structure, only in the udf did it change.
>  
> I don't know why this happens, but i do know it has something to do with the fact that that row was the 10,001th row and last row in it's partition. Pandas batch size is 10,000 so that row was sent as a second batch alone, and that's the only thing that seems to cause it, having 1 or 2 rows in a second batch of the partition. I was also able to get this with a second batch of 2 rows, the list wasn't changed except an empty list was added to the end. 
> Hope you can help me understand what is going on, thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org