You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jeff gold (JIRA)" <ji...@apache.org> on 2019/04/19 07:25:00 UTC

[jira] [Created] (SPARK-27519) Pandas udf corrupting data

Jeff gold created SPARK-27519:
---------------------------------

             Summary: Pandas udf corrupting data
                 Key: SPARK-27519
                 URL: https://issues.apache.org/jira/browse/SPARK-27519
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.3.0
            Reporter: Jeff gold


While trying to use a pandas udf, i sent the udf 2 columns, a string and a list of a list of strings. The second argument structure for example: [['1'],['2'],['3']]

But when getting this same value in the udf, i receive something like this: [['1','2'],['3'],[]]

I checked and the same row in the table has the list with the correct structure, only in the udf did it change.

 

I don't know why this happens, but i do know it has something to do with the fact that that row was the 10,001th row and last row in it's partition. Pandas batch size is 10,000 so that row was sent as a second batch alone, and that's the only thing that seems to cause it, having 1 or 2 rows in a second batch of the partition. I was also able to get this with a second batch of 2 rows, the list wasn't changed except an empty list was added to the end. 

Hope you can help me understand what is going on, thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org