You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Baohe Zhang (Jira)" <ji...@apache.org> on 2021/02/25 17:31:00 UTC

[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together

     [ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Baohe Zhang updated SPARK-34545:
--------------------------------
    Description: 
Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2.

How to reproduce it?

{code:python}
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
    def getLastElementWithTime(list_elm):
        """x should be a list of (val, time), val can be a single element or a list
        """
        y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
        return y[-1][0]
    return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()
{code}

Results:
{noformat}
>>> df.select("c3").show()
+---+                                                                           
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---+----+
| c3|  c4|
+---+----+
|1.0|null|
|2.0|null|
|3.1|   3|
+---+----+
{noformat}

The test was done in branch-3.1 local mode.


  was:
Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2.

How to reproduce it?

{code:python}
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
    def getLastElementWithTime(list_elm):
        """x should be a list of (val, time), val can be a single element or a list
        """
        y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
        return y[-1][0]
    return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()
{code}

Results:
{noformat}
>>> df.select("c3").show()
+---+                                                                           
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---+----+
| c3|  c4|
+---+----+
|1.0|{color:red}null{color}|
|2.0|{color:red}null{color}|
|3.1|   3|
+---+----+
{noformat}

The test was done in branch-3.1 local mode.



> PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-34545
>                 URL: https://issues.apache.org/jira/browse/SPARK-34545
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0
>            Reporter: Baohe Zhang
>            Priority: Major
>
> Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
>     def getLastElementWithTime(list_elm):
>         """x should be a list of (val, time), val can be a single element or a list
>         """
>         y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
>         return y[-1][0]
>     return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+                                                                           
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---+----+
> | c3|  c4|
> +---+----+
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---+----+
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org