You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2018/09/26 08:43:00 UTC
[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

    [ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628428#comment-16628428 ] 

Liang-Chi Hsieh edited comment on SPARK-25461 at 9/26/18 8:42 AM:
------------------------------------------------------------------

When your data has not None, do you still have {{.where(column.notnull())}} on the pandas dataframe?

{{.where(column.notnull())}} returns pandas dataframe in float64 type, not bool. And looks like float values are interpreted as false, if the pandas udf's return type is BooleanType:
{code:java}
values = [1.0] * 5 + [2.0] * 5
pdf = pd.DataFrame({'A': values})
df = self.spark.createDataFrame(pdf)
@pandas_udf(returnType=BooleanType())
def to_boolean(column):
    return column
df.select(['A']).withColumn('to_boolean', to_boolean('A')).show()

+---+----------+                                            
|  A|to_boolean|       
+---+----------+                                                      
|1.0|     false|                                                             
|1.0|     false|                                                      
|1.0|     false|                  
|1.0|     false|                                                                                              
|1.0|     false|                                                                                                                                       
|2.0|     false|                                                                                 
|2.0|     false|                              
|2.0|     false|                                                                  
|2.0|     false|                                            
|2.0|     false|                                                      
+---+----------+      
{code}
So that's said the above {{gt_2}} pandas udf returns false for both 1.0 and 2.0 in your input data.


was (Author: viirya):
When your data has not None, do you still have {{.where(column.notnull())}}  on the pandas dataframe?

{{.where(column.notnull())}} returns pandas dataframe in float64 type, not bool. And looks like float values are interpreted as false, if the pandas udf's return type is BooleanType:

{code}
values = [1.0] * 5 + [2.0] * 5
pdf = pd.DataFrame({'A': values})
df = self.spark.createDataFrame(pdf)
@pandas_udf(returnType=BooleanType())
def to_boolean(column):
return column
df.select(['A']).withColumn('to_boolean', to_boolean('A')).show()

+---+----------+                                            
|  A|to_boolean|       
+---+----------+                                                      
|1.0|     false|                                                             
|1.0|     false|                                                      
|1.0|     false|                  
|1.0|     false|                                                                                              
|1.0|     false|                                                                                                                                       
|2.0|     false|                                                                                 
|2.0|     false|                              
|2.0|     false|                                                                  
|2.0|     false|                                            
|2.0|     false|                                                      
+---+----------+      
{code}

So that's said the above {{gt_2}} pandas udf returns false for both 1.0 and 2.0 in your input data.

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-25461
>                 URL: https://issues.apache.org/jira/browse/SPARK-25461
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1
>         Environment: I reproduced this issue by running pyspark locally on mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>            Reporter: Chongyuan Xiang
>            Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
>     return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
>     .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-----------------+-----------+
> | A|potential_bad_col|correct_col|
> +---+-----------------+-----------+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-----------------+-----------+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org