You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kuku1 (JIRA)" <ji...@apache.org> on 2016/11/23 13:13:58 UTC

[jira] [Updated] (SPARK-18562) Correlation causes Error “Cannot determine the number of cols”

     [ https://issues.apache.org/jira/browse/SPARK-18562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kuku1 updated SPARK-18562:
--------------------------
    Component/s: MLlib

> Correlation causes Error “Cannot determine the number of cols”
> --------------------------------------------------------------
>
>                 Key: SPARK-18562
>                 URL: https://issues.apache.org/jira/browse/SPARK-18562
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 14.04LTS
>            Reporter: Kuku1
>
> I followed the MLlib docs on how to calculate a correlation. I'm using Spark 1.6.1.
> First my application filters out elements that do not have all the values I'm looking for. Afterwards, I'm mapping each of the remaining elements to a dense Vector, as seen in the docs. Then I'm passing the RDD[Vector] to the MLlib function.
> My code is the following:
> {code}
> val filteredRdd = rdd.filter(document => document.containsKey("SomeValue1")
>   && document.containsKey("SomeValue2") && document.containsKey("SomeValue3"))
> val vectorRdd: RDD[Vector] = filteredRdd.map(document => {
>   Vectors.dense(document.getDouble("SomeValue1"), document.getDouble("SomeValue2"), document.getDouble("SomeValue3"))
> })
> val correlation_matrix = Statistics.corr(vectorRdd, method = "spearman")
> println("Spearman: " + correlation_matrix.toString())
> val correlation_matrix_pearson = Statistics.corr(vectorRdd, method = "pearson")
> println("Pearson: " + correlation_matrix_pearson.toString())
> {code}
> This is the error that gets thrown:
> {code}
> 16/11/23 13:19:51 ERROR ApplicationMaster: User class threw exception: 
> java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
> java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
>     at scala.sys.package$.error(package.scala:27)
>     at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
>     at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:328)
>     at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
>     at org.apache.spark.mllib.stat.correlation.SpearmanCorrelation$.computeCorrelationMatrix(SpearmanCorrelation.scala:91)
>     at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
>     at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
> {code}
> Because I filter out the elements which would cause an empty vector, I don't see how this is related to my code? Thus I created this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org