You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/12/11 23:23:46 UTC

[jira] [Resolved] (SPARK-11497) PySpark RowMatrix Constructor Has Type Erasure Issue

     [ https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley resolved SPARK-11497.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.1
                   1.5.3
                   2.0.0

Issue resolved by pull request 9458
[https://github.com/apache/spark/pull/9458]

> PySpark RowMatrix Constructor Has Type Erasure Issue
> ----------------------------------------------------
>
>                 Key: SPARK-11497
>                 URL: https://issues.apache.org/jira/browse/SPARK-11497
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 1.5.0, 1.5.1, 1.6.0
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>            Priority: Minor
>             Fix For: 2.0.0, 1.5.3, 1.6.1
>
>
> Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark RowMatrix constructor. As discussed on the dev list [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html], there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a RowMatrix from an RDD[Vector] in [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115], the Vector type is erased, resulting in an RDD[Object]. Thus, when calling Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an Object cannot be cast to a Spark Vector. As noted in the aforementioned dev list thread, this issue was also encountered with DecisionTrees, and the fix involved an explicit retag of the RDD with a Vector type. Thus, this PR will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely due to their related helper functions in PythonMLlibAPI creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. 
> The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:
> {code}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
> mat = RowMatrix(rows)
> mat._java_matrix_wrapper.call("tallSkinnyQR", True)
> {code}
> Should result in the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org