You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Chongyuan Xiang (JIRA)" <ji...@apache.org> on 2018/09/19 04:50:00 UTC

[jira] [Created] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

Chongyuan Xiang created SPARK-25461:
---------------------------------------

             Summary: PySpark Pandas UDF outputs incorrect results when input columns contain None
                 Key: SPARK-25461
                 URL: https://issues.apache.org/jira/browse/SPARK-25461
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.3.1
         Environment: I reproduced this issue by running pyspark locally on mac:

Spark version: 2.3.1 pre-built with Hadoop 2.7

Python library versions: pyarrow==0.10.0, pandas==0.20.2
            Reporter: Chongyuan Xiang


The following PySpark script uses a simple pandas UDF to calculate a column given column 'A'. When column 'A' contains None, the results look incorrect.

Script: 

 
{code:java}
import pandas as pd
import random
import pyspark
from pyspark.sql.functions import col, lit, pandas_udf

values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000
random.shuffle(values)
pdf = pd.DataFrame({'A': values})
df = spark.createDataFrame(pdf)

@pandas_udf(returnType=pyspark.sql.types.BooleanType())
def gt_2(column):
    return (column >= 2).where(column.notnull())

calculated_df = (df.select(['A'])
    .withColumn('potential_bad_col', gt_2('A'))
)

calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | (col("A").isNull()))

calculated_df.show()
{code}
 

Output:
{code:java}
+---+-----------------+-----------+
| A|potential_bad_col|correct_col|
+---+-----------------+-----------+
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|1.0| false| false|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
+---+-----------------+-----------+
only showing top 20 rows
{code}
This problem disappears when the number of rows is small or when the input column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org