You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kuku1 (JIRA)" <ji...@apache.org> on 2016/11/23 13:12:58 UTC
[jira] [Created] (SPARK-18562) Correlation causes Error “Cannot determine the number of cols”

Kuku1 created SPARK-18562:
-----------------------------

             Summary: Correlation causes Error “Cannot determine the number of cols”
                 Key: SPARK-18562
                 URL: https://issues.apache.org/jira/browse/SPARK-18562
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.6.1
         Environment: Ubuntu 14.04LTS
            Reporter: Kuku1


I followed the MLlib docs on how to calculate a correlation. I'm using Spark 1.6.1.

First my application filters out elements that do not have all the values I'm looking for. Afterwards, I'm mapping each of the remaining elements to a dense Vector, as seen in the docs. Then I'm passing the RDD[Vector] to the MLlib function.

My code is the following:
{code}
val filteredRdd = rdd.filter(document => document.containsKey("SomeValue1")
  && document.containsKey("SomeValue2") && document.containsKey("SomeValue3"))

val vectorRdd: RDD[Vector] = filteredRdd.map(document => {
  Vectors.dense(document.getDouble("SomeValue1"), document.getDouble("SomeValue2"), document.getDouble("SomeValue3"))
})

val correlation_matrix = Statistics.corr(vectorRdd, method = "spearman")
println("Spearman: " + correlation_matrix.toString())

val correlation_matrix_pearson = Statistics.corr(vectorRdd, method = "pearson")
println("Pearson: " + correlation_matrix_pearson.toString())
{code}

This is the error that gets thrown:
{code}
16/11/23 13:19:51 ERROR ApplicationMaster: User class threw exception: 

java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:328)
    at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
    at org.apache.spark.mllib.stat.correlation.SpearmanCorrelation$.computeCorrelationMatrix(SpearmanCorrelation.scala:91)
    at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
    at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
{code}

Because I filter out the elements which would cause an empty vector, I don't see how this is related to my code? Thus I created this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org