You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chongyuan Xiang (JIRA)" <ji...@apache.org> on 2018/09/19 04:50:00 UTC
[jira] [Created] (SPARK-25461) PySpark Pandas UDF outputs incorrect
results when input columns contain None
Chongyuan Xiang created SPARK-25461:
---------------------------------------
Summary: PySpark Pandas UDF outputs incorrect results when input columns contain None
Key: SPARK-25461
URL: https://issues.apache.org/jira/browse/SPARK-25461
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.3.1
Environment: I reproduced this issue by running pyspark locally on mac:
Spark version: 2.3.1 pre-built with Hadoop 2.7
Python library versions: pyarrow==0.10.0, pandas==0.20.2
Reporter: Chongyuan Xiang
The following PySpark script uses a simple pandas UDF to calculate a column given column 'A'. When column 'A' contains None, the results look incorrect.
Script:
{code:java}
import pandas as pd
import random
import pyspark
from pyspark.sql.functions import col, lit, pandas_udf
values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000
random.shuffle(values)
pdf = pd.DataFrame({'A': values})
df = spark.createDataFrame(pdf)
@pandas_udf(returnType=pyspark.sql.types.BooleanType())
def gt_2(column):
return (column >= 2).where(column.notnull())
calculated_df = (df.select(['A'])
.withColumn('potential_bad_col', gt_2('A'))
)
calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | (col("A").isNull()))
calculated_df.show()
{code}
Output:
{code:java}
+---+-----------------+-----------+
| A|potential_bad_col|correct_col|
+---+-----------------+-----------+
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|1.0| false| false|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
+---+-----------------+-----------+
only showing top 20 rows
{code}
This problem disappears when the number of rows is small or when the input column does not contain None.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org