You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "liujiayi771 (via GitHub)" <gi...@apache.org> on 2023/11/08 05:53:27 UTC

[PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

liujiayi771 opened a new pull request, #43711:
URL: https://github.com/apache/spark/pull/43711

### What changes were proposed in this pull request?
Modify the calculation formula of Pearson correlation.

### Why are the changes needed?
Spark uses the formula `ck / sqrt(xMk * yMk)` to calculate the Pearson Correlation Coefficient. If `xMk` and `yMk` are very small, it can lead to double multiplication overflow, resulting in a denominator of 0. This leads to an Infinity result in the calculation.

For example, when calculating the correlation for the same columns a and b in a table, the result will be Infinity, but the correlation for identical columns should be 1.0 instead.

a | b
-- | --
1e-200 | 1e-200
1e-200 | 1e-200
1e-100 | 1e-100

Modifying the formula to `ck / sqrt(xMk) / sqrt(yMk)` can indeed solve this issue and improve the stability of the calculation. The benefit of this modification is that it splits the square root of the denominator into two parts: `sqrt(xMk)` and `sqrt(yMk)`. This helps avoid multiplication overflow or cases where the product of extremely small values becomes zero.

### Does this PR introduce _any_ user-facing change?
Previously, the corr result for two columns that used to be equal to 1.0 may now become 0.9999999999999999.

### How was this patch tested?
Add a unit test case in DataFrameStatSuite.

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1398705219


##########
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##########
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint>
 -- !query output
-1.0	1.0	3
+0.9999999999999999	0.9999999999999999	3

Review Comment:
   @beliefer Different mainstream databases have both of these calculation formulas. But I now believe that there is no need to modify Spark's code for this extreme case because Spark's formula can easily solve for finite decimals.
   Thanks, I will close this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387436911


##########
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##########
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint>
 -- !query output
-1.0	1.0	3
+0.9999999999999999	0.9999999999999999	3

Review Comment:
   You can reference the output of other mainstream databases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1802177820

   After my modification, the original result of 1.0 has now become 0.9999999999999999. Although they are numerically equivalent, it looks a bit strange.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1801777959

   cc @beliefer FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1801148109

   CC: @FelixYBW


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1801761277

   Seems the test failed related to this pr?
   
   https://github.com/liujiayi771/spark/actions/runs/6794362848/job/18471276466
   
   ![image](https://github.com/apache/spark/assets/1475305/1ff021c0-57ed-4986-9ce5-833a574a1bae)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387434023


##########
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##########
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint>
 -- !query output
-1.0	1.0	3
+0.9999999999999999	0.9999999999999999	3

Review Comment:
   The result is not incorrect, it is just a precision issue with double. For example,
   ```java
   2 / Math.sqrt(2 * 2) = 1.0
   2 / Math.sqrt(2) * Math.sqrt(2) = 0.9999999999999999
   ```
    From the user's perspective, 1.0 is more user-friendly. 
   I am currently unsure about whether to sacrifice user-friendliness in order to support an extreme case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387434023


##########
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##########
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint>
 -- !query output
-1.0	1.0	3
+0.9999999999999999	0.9999999999999999	3

Review Comment:
   The result is not incorrect, it is just a precision issue with double. For example,
   ```java
   2 / Math.sqrt(2 * 2) = 1.0
   2 / Math.sqrt(2) / Math.sqrt(2) = 0.9999999999999999
   ```
    From the user's perspective, 1.0 is more user-friendly. 
   I am currently unsure about whether to sacrifice user-friendliness in order to support an extreme case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 closed pull request #43711: [SPARK-45834][SQL] Fix Pearson correlation calculation more stable
URL: https://github.com/apache/spark/pull/43711


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1801786561

   @LuciferYang The modifications have caused a change in the precision of the calculation results. I will fix them all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1801849084

   @liujiayi771 Thank you for the fix. I will see later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "liujiayi771 (via GitHub)" <gi...@apache.org>.

liujiayi771 commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1803054998

   The potential side effect of this modification is that it is easier to obtain a finite number for `sqrt(xMk * yMk)`, while `sqrt(xMk) * sqrt(yMk)` can easily result in an infinite number, for example,
   ```java
   Math.sqrt(2 * 2) = 2.0
   Math.sqrt(2) * Math.sqrt(2) = 2.0000000000000004
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387423207


##########
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##########
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct<corr(DISTINCT x, y):double,corr(DISTINCT y, x):double,count(1):bigint>
 -- !query output
-1.0	1.0	3
+0.9999999999999999	0.9999999999999999	3

Review Comment:
   I guess `0.9999999999999999` is incorrect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org