You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "dronzer (Jira)" <ji...@apache.org> on 2023/03/23 05:22:00 UTC

[jira] [Updated] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.

     [ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dronzer updated SPARK-42905:
----------------------------
    Attachment: image-2023-03-23-10-51-28-420.png

> pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-42905
>                 URL: https://issues.apache.org/jira/browse/SPARK-42905
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.3.0
>            Reporter: dronzer
>            Priority: Blocker
>         Attachments: image-2023-03-23-10-51-28-420.png
>
>
> pyspark.ml.stat.Correlation
> Following is the Scenario where the Correlation function fails for giving correct Spearman Coefficient Results.
> Tested E.g -> Spark DataFrame has 2 columns A and B.
> Column A has 3 Distinct Values and total of 108Million rows
> Column B has 4 Distinct Values and total of 108Million rows
> If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, it gives the correct answer even if i run the same code multiple times the same answer is produced.
> !image-2023-03-23-10-38-49-071.png|width=526,height=258!
>  
> Coming to Spark and using Spearman Correlation produces a *different results* for the *same dataframe* on multiple runs. (see below)
> !image-2023-03-23-10-41-38-696.png|width=527,height=329!
>  
> Basically in python Pandas Df.corr it gives same results on same dataframe on multiple runs which is expected behaviour. However, in Spark using the same data it gives different result, moreover running the same cell with same data multiple times produces different results meaning the output is inconsistent.
> Coming to data the only observation I could conclude is Ties in data. (Only 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark Correlation method as the same data when used in python using df.corr produces consistent results.
> The only Workaround we could find to get consistent and the same output as from python in Spark is by using Pandas UDF as shown below:
> !image-2023-03-23-10-48-01-045.png|width=554,height=94!
> !image-2023-03-23-10-48-55-922.png|width=568,height=301!
>  
> We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect and inconsistent results for this case too.
> Only PandasUDF seems to provide consistent results.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org