You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/07/05 03:04:00 UTC
[jira] [Commented] (SPARK-39664) RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)
[ https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562339#comment-17562339 ]
Hyukjin Kwon commented on SPARK-39664:
--------------------------------------
[~igaloly] mind sharing simplified versions of your codes?
> RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)
> ----------------------------------------------------------------
>
> Key: SPARK-39664
> URL: https://issues.apache.org/jira/browse/SPARK-39664
> Project: Spark
> Issue Type: Bug
> Components: Pandas API on Spark, PySpark
> Affects Versions: 3.2.1
> Reporter: igal l
> Priority: Major
>
> I have a Pyspark DF with one column. This column type is Vector and the values are DenseVectors of size 768. The DF has 1 million rows.
> I want to calculate the Covariance matrix of this set of vectors.
> When I try to calculate it with `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts.
> When I try to calculate the Correlation matrix with `Correlation.corr(df, '_1')`, it takes 33 seconds.
> Covariance and Correlation's formula are pretty much the same, therefore, I don't understand the gap between them
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org