You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jiayi Liu (Jira)" <ji...@apache.org> on 2023/11/08 05:39:00 UTC

[jira] [Created] (SPARK-45834) Fix Pearson correlation calculation more stable

Jiayi Liu created SPARK-45834:
---------------------------------

             Summary: Fix Pearson correlation calculation more stable
                 Key: SPARK-45834
                 URL: https://issues.apache.org/jira/browse/SPARK-45834
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Jiayi Liu


Spark uses the formula {{ck / sqrt(xMk * yMk)}} to calculate the Pearson Correlation Coefficient. If {{xMk}} and {{yMk}} are very small, it can lead to double multiplication overflow, resulting in a denominator of 0. This leads to a NaN result in the calculation.

For example, when calculating the correlation for the same columns a and b in a table, the result will be Infinity, but the correlation for identical columns should be 1.0 instead.
||a||b||
|1e-200|1e-200|
|1e-200|1e-200|
|1e-100|1e-100|

Modifying the formula to {{ck / sqrt(xMk) / sqrt(yMk)}} can indeed solve this issue and improve the stability of the calculation. The benefit of this modification is that it splits the square root of the denominator into two parts: {{sqrt(xMk)}} and {{{}sqrt(yMk){}}}. This helps avoid multiplication overflow or cases where the product of extremely small values becomes zero.
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org