You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kuku1 (JIRA)" <ji...@apache.org> on 2016/11/23 13:12:58 UTC
[jira] [Created] (SPARK-18562) Correlation causes Error “Cannot determine the number of cols”
Kuku1 created SPARK-18562:
-----------------------------
Summary: Correlation causes Error “Cannot determine the number of cols”
Key: SPARK-18562
URL: https://issues.apache.org/jira/browse/SPARK-18562
Project: Spark
Issue Type: Bug
Affects Versions: 1.6.1
Environment: Ubuntu 14.04LTS
Reporter: Kuku1
I followed the MLlib docs on how to calculate a correlation. I'm using Spark 1.6.1.
First my application filters out elements that do not have all the values I'm looking for. Afterwards, I'm mapping each of the remaining elements to a dense Vector, as seen in the docs. Then I'm passing the RDD[Vector] to the MLlib function.
My code is the following:
{code}
val filteredRdd = rdd.filter(document => document.containsKey("SomeValue1")
&& document.containsKey("SomeValue2") && document.containsKey("SomeValue3"))
val vectorRdd: RDD[Vector] = filteredRdd.map(document => {
Vectors.dense(document.getDouble("SomeValue1"), document.getDouble("SomeValue2"), document.getDouble("SomeValue3"))
})
val correlation_matrix = Statistics.corr(vectorRdd, method = "spearman")
println("Spearman: " + correlation_matrix.toString())
val correlation_matrix_pearson = Statistics.corr(vectorRdd, method = "pearson")
println("Pearson: " + correlation_matrix_pearson.toString())
{code}
This is the error that gets thrown:
{code}
16/11/23 13:19:51 ERROR ApplicationMaster: User class threw exception:
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:328)
at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at org.apache.spark.mllib.stat.correlation.SpearmanCorrelation$.computeCorrelationMatrix(SpearmanCorrelation.scala:91)
at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
{code}
Because I filter out the elements which would cause an empty vector, I don't see how this is related to my code? Thus I created this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org