You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/08/16 17:05:00 UTC
[jira] [Commented] (SPARK-32612) int columns produce inconsistent results on pandas UDFs

    [ https://issues.apache.org/jira/browse/SPARK-32612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178543#comment-17178543 ] 

Sean R. Owen commented on SPARK-32612:
--------------------------------------

I don't think it's correct to upgrade it to float in all cases. int overflow is simply what you have to deal with if you assert that you are using a 32-bit data type. Use long instead if you need more bytes.

> int columns produce inconsistent results on pandas UDFs
> -------------------------------------------------------
>
>                 Key: SPARK-32612
>                 URL: https://issues.apache.org/jira/browse/SPARK-32612
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> This is similar to SPARK-30187 but I personally consider this data corruption.
> If I have a simple pandas UDF
> {code}
>  >>> def add(a, b):
>         return a + b
>  >>> my_udf = pandas_udf(add, returnType=LongType())
> {code}
> And I want to process some data with it, say 32 bit ints
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,4)], StructType([StructField("a", IntegerType()), StructField("b", IntegerType())]))
> >>> df.select(my_udf(col("a") - 3, col("b")).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|    -2052657052|
> |         3|         4|              4|
> +----------+----------+---------------+
> {code}
> I get an integer overflow for the data as I would expect.  But as soon as I add a {{None}} to the data, even on a different row the result I get back is totally different.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None)], StructType([StructField("a", IntegerType()), StructField("b", IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> +----------+----------+---------------+
> {code}
> The integer overflow disappears.  This is because arrow and/or pandas changes the data type to a float in order to be able to store the null value.  So then the processing is being done on floating point there is no overflow.  This in and of itself is annoying but understandable because it is dealing with a limitation in pandas. 
> Where it becomes a bug is that this happens on a per batch basis.  This means that I can have the same two rows in different parts of my data set and get different results depending on their proximity to a null value.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None),(1037694399, 1204615848),(3,4)], StructType([StructField("a", IntegerType()), StructField("b", IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> |1037694399|1204615848|     2242310244|
> |         3|         4|              4|
> +----------+----------+---------------+
> >>> spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '2')
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> |1037694399|1204615848|    -2052657052|
> |         3|         4|              4|
> +----------+----------+---------------+
> {code}
> For me personally I would prefer to have all nullable integer columns upgraded to float all the time, that way it is at least consistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org