You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "xiaoa6435 (via GitHub)" <gi...@apache.org> on 2023/08/10 14:38:55 UTC

[GitHub] [spark] xiaoa6435 opened a new pull request, #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

xiaoa6435 opened a new pull request, #42431:
URL: https://github.com/apache/spark/pull/42431

   ### What changes were proposed in this pull request?
   
   don't flush when ties size exceed 10_000_000, the old behavior results wrong average rank
   
   ### Why are the changes needed?
   
   when data has huge amount of Ties( > 10_000_000), old implementation give incorrect and inconsistent result. for example
   
   #### scala
   ```scala
   import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
   import org.apache.spark.ml.stat.Correlation
   import org.apache.spark.sql.Row
   
   val N = 10000002
   val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0)
   val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0)
   //val s1 = Statistics.corr(x, y, "spearman")
   val df = x.zip(y)
     .map{case (x, y) => Vectors.dense(x, y)}
     .map(Tuple1.apply)
     .repartition(1) 
     .toDF("features")
     
   val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head
   val r = coeff1(0, 1)
   println(s"pearson correlation in spark: $r")
   // pearson correlation in spark: -9.999990476024495E-8
   ```
   
   current implementation result is -9.999990476024495E-8(unstable), and correct result is -1.0(in R/python and manual calculation)
   
   #### r
   ```r
   N = 10000002
   x = ifelse(0:(N - 1) < N - 1, 1.0, 2.0)
   y = ifelse(0:(N - 1) < N - 1, 2.0, 1.0)
   r = cor(x, y, method = 'spearman') 
   sprintf("pearson correlation in r: %0.6f", r)
   # pearson correlation in r: -1.000000
   ```
   
   #### python
   ```python
   import numpy as np
   from scipy import stats
   
   N = 10000002
   x = np.array(list(1.0 if i < N - 1 else 2.0 for i in range(N)))
   y = np.array(list(2.0 if i < N - 1 else 1.0 for i in range(N)))
   r = stats.spearmanr(x, y).correlation
   print(f"pearson correlation in python scipy.stats: {r}")
   # pearson correlation in python scipy.stats: -1.0
   ```
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   add new test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1290829879


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   What happens if we don't flush it when `cachedUids.size >= 10000000` ? OOM occurs ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaoa6435 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "xiaoa6435 (via GitHub)" <gi...@apache.org>.
xiaoa6435 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1290839384


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   cachedUids.size <= 10000000 can't prevent oom, sortbykey will pull all ties values to a single partitions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1291076000


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   > flush of cachedUids.size >= N can't generate approximate results. 
   
   emm, then the logic does not make sense.. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1291046584


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   with the flush of `cachedUids.size >= N`, does it generate approximate result ? If yes I suggest we can provide a config for cachedUids.size 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaoa6435 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "xiaoa6435 (via GitHub)" <gi...@apache.org>.
xiaoa6435 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1292926346


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   
   
   
   
   > I think this limitation is on purpose.
   > 
   > I think we could either 1, document this limitation; 2, or add a new parameter for it.
   > 
   > also cc @srowen @WeichenXu123
   
   the intent may be to prevent OOM, but it causes some other problems
   
   - flush when ties size > N can't give a approximate results, and in many cases, is far from the correct results, as it results in distinct average ranks for tied values
   - ties size > 10_000_000 are not uncommon. real-world data tends to exhibit skewness with a substantial number of zero values
   - alternative workaound may need multiple traversal iterations
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WeichenXu123 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "WeichenXu123 (via GitHub)" <gi...@apache.org>.
WeichenXu123 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1293508065


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   got it. I agree to remove the size > 10_000_000 condition here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaoa6435 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "xiaoa6435 (via GitHub)" <gi...@apache.org>.
xiaoa6435 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1290839384


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   cachedUids.size <= 10000000 can't prevent oom, sortbykey will pull all ties values to a single partitions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaoa6435 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "xiaoa6435 (via GitHub)" <gi...@apache.org>.
xiaoa6435 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1291065703


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   flush of cachedUids.size >= N can't generate approximate results.  It might seem better to generate a warning without flush



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1290822619


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   I think this limitation is on purpose.
   
   I think we could either
   1, document this limitation;
   2, or add a new parameter for it.
   
   also cc @srowen @WeichenXu123 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaoa6435 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "xiaoa6435 (via GitHub)" <gi...@apache.org>.
xiaoa6435 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1290839233


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   cachedUids.size of 10_000_000 elements is about 38MB. without a size limit on cachedUids, it could lead to OOM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaoa6435 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "xiaoa6435 (via GitHub)" <gi...@apache.org>.
xiaoa6435 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1291065703


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   flush of cachedUids.size >= N can't generate approximate results (far from the correct results).  It might seem better to generate a warning without flush



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xiaoa6435 commented on a diff in pull request #42431: [WIP][SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties

Posted by "xiaoa6435 (via GitHub)" <gi...@apache.org>.
xiaoa6435 commented on code in PR #42431:
URL: https://github.com/apache/spark/pull/42431#discussion_r1290839233


##########
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala:
##########
@@ -65,8 +65,8 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         output
       }
       iter.flatMap { case (((j, v), uid), rank) =>
-        // If we see a new value or cachedUids is too big, we flush ids with their average rank.
-        if (j != preCol || v != preVal || cachedUids.size >= 10000000) {

Review Comment:
   cachedUids.size <= 10000000 can't prevent oom, sortbykey will pull all ties values to a single partitions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-42905][MLLIB] fix spearman correlation incorrect and inconsistent results when data has huge amount of ties [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #42431:
URL: https://github.com/apache/spark/pull/42431#issuecomment-1825015443

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org