You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Md. Rezaul Karim" <re...@insight-centre.org> on 2017/04/09 14:01:13 UTC

How to convert Spark MLlib vector to ML Vector?

I have already posted this question to the StackOverflow
<http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
However, not getting any response from someone else. I'm trying to use
RandomForest algorithm for the classification after applying the PCA
technique since the dataset is pretty high-dimensional. Here's my source
code:

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession

import org.apache.spark.ml.feature.PCA
import org.apache.spark.rdd.RDD

object PCAExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .master("local[*]")
      .config("spark.sql.warehouse.dir", "E:/Exp/")
      .appName(s"OneVsRestExample")
      .getOrCreate()

    val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")

    val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
    val (trainingData, testData) = (splits(0), splits(1))

    val sqlContext = new SQLContext(spark.sparkContext)
    import sqlContext.implicits._
    val trainingDF = trainingData.toDF("label", "features")

    val pca = new PCA()
      .setInputCol("features")
      .setOutputCol("pcaFeatures")
      .setK(100)
      .fit(trainingDF)

    val pcaTrainingData = pca.transform(trainingDF)
    //pcaTrainingData.show()

    val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
      row.getAs[Double]("label"),
      row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))

    //val labeled = pca.transform(trainingDF).rdd.map(row =>
LabeledPoint(row.getAs[Double]("label"),
    //  Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))

    val numClasses = 10
    val categoricalFeaturesInfo = Map[Int, Int]()
    val numTrees = 10 // Use more in practice.
    val featureSubsetStrategy = "auto" // Let the algorithm choose.
    val impurity = "gini"
    val maxDepth = 20
    val maxBins = 32

    val model = RandomForest.trainClassifier(labeled, numClasses,
categoricalFeaturesInfo,
      numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
  }
}

However, I'm getting the following error:

*Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: Column features must be of type
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*

What am I doing wrong in my code?  Actually, I'm getting the above
exception in this line:

val pca = new PCA()
      .setInputCol("features")
      .setOutputCol("pcaFeatures")
      .setK(100)
      .fit(trainingDF) /// GETTING EXCEPTION HERE

Please, someone, help me to solve the problem.





Kind regards,
*Md. Rezaul Karim*

Re: How to convert Spark MLlib vector to ML Vector?

Posted by "Md. Rezaul Karim" <re...@insight-centre.org>.
Hi Yan, Ryan, and Nick,

Actually, for a special use case, I had to use RDD-based Spark MLlib which
did not work eventually. Therefore, I had to switch to Spark ML later on.

Thanks for your support guys.




Regards,
_________________________________
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 10 April 2017 at 06:45, 颜发才(Yan Facai) <fa...@gmail.com> wrote:

> how about using
>
> val dataset = spark.read.format("libsvm")
>   .option("numFeatures", "780")
>   .load("data/mllib/sample_libsvm_data.txt")
>
> instead of
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>
>
>
>
> On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ry...@gmail.com> wrote:
>
>> you could write a udf using the asML method along with some type casting,
>> then apply the udf to data after pca.
>>
>> when using pipeline, that udf need to be wrapped in a customized
>> transformer, I think.
>>
>> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Why not use the RandomForest from Spark ML?
>>>
>>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>>> rezaul.karim@insight-centre.org> wrote:
>>>
>>>> I have already posted this question to the StackOverflow
>>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>>> However, not getting any response from someone else. I'm trying to use
>>>> RandomForest algorithm for the classification after applying the PCA
>>>> technique since the dataset is pretty high-dimensional. Here's my source
>>>> code:
>>>>
>>>> import org.apache.spark.mllib.util.MLUtils
>>>> import org.apache.spark.mllib.tree.RandomForest
>>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>>> import org.apache.spark.mllib.regression.LabeledPoint
>>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>>> import org.apache.spark.sql._
>>>> import org.apache.spark.sql.SQLContext
>>>> import org.apache.spark.sql.SparkSession
>>>>
>>>> import org.apache.spark.ml.feature.PCA
>>>> import org.apache.spark.rdd.RDD
>>>>
>>>> object PCAExample {
>>>>   def main(args: Array[String]): Unit = {
>>>>     val spark = SparkSession
>>>>       .builder
>>>>       .master("local[*]")
>>>>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>>       .appName(s"OneVsRestExample")
>>>>       .getOrCreate()
>>>>
>>>>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>>>
>>>>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>>     val (trainingData, testData) = (splits(0), splits(1))
>>>>
>>>>     val sqlContext = new SQLContext(spark.sparkContext)
>>>>     import sqlContext.implicits._
>>>>     val trainingDF = trainingData.toDF("label", "features")
>>>>
>>>>     val pca = new PCA()
>>>>       .setInputCol("features")
>>>>       .setOutputCol("pcaFeatures")
>>>>       .setK(100)
>>>>       .fit(trainingDF)
>>>>
>>>>     val pcaTrainingData = pca.transform(trainingDF)
>>>>     //pcaTrainingData.show()
>>>>
>>>>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>>       row.getAs[Double]("label"),
>>>>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>>
>>>>     //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>>>>     //  Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>>
>>>>     val numClasses = 10
>>>>     val categoricalFeaturesInfo = Map[Int, Int]()
>>>>     val numTrees = 10 // Use more in practice.
>>>>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>>     val impurity = "gini"
>>>>     val maxDepth = 20
>>>>     val maxBins = 32
>>>>
>>>>     val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>>>>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>>   }
>>>> }
>>>>
>>>> However, I'm getting the following error:
>>>>
>>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>>> requirement failed: Column features must be of type
>>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>>
>>>> What am I doing wrong in my code?  Actually, I'm getting the above
>>>> exception in this line:
>>>>
>>>> val pca = new PCA()
>>>>       .setInputCol("features")
>>>>       .setOutputCol("pcaFeatures")
>>>>       .setK(100)
>>>>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>>
>>>> Please, someone, help me to solve the problem.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>> *Md. Rezaul Karim*
>>>>
>>>
>>
>

Re: How to convert Spark MLlib vector to ML Vector?

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
By the way, always try to use `ml`, instead of `mllib`.

import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.classification.RandomForestClassifier
or
import org.apache.spark.ml.regression.RandomForestRegressor


more details, see
http://spark.apache.org/docs/latest/ml-classification-regression.html.



On Mon, Apr 10, 2017 at 1:45 PM, 颜发才(Yan Facai) <fa...@gmail.com> wrote:

> how about using
>
> val dataset = spark.read.format("libsvm")
>   .option("numFeatures", "780")
>   .load("data/mllib/sample_libsvm_data.txt")
>
> instead of
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>
>
>
>
> On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ry...@gmail.com> wrote:
>
>> you could write a udf using the asML method along with some type casting,
>> then apply the udf to data after pca.
>>
>> when using pipeline, that udf need to be wrapped in a customized
>> transformer, I think.
>>
>> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Why not use the RandomForest from Spark ML?
>>>
>>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>>> rezaul.karim@insight-centre.org> wrote:
>>>
>>>> I have already posted this question to the StackOverflow
>>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>>> However, not getting any response from someone else. I'm trying to use
>>>> RandomForest algorithm for the classification after applying the PCA
>>>> technique since the dataset is pretty high-dimensional. Here's my source
>>>> code:
>>>>
>>>> import org.apache.spark.mllib.util.MLUtils
>>>> import org.apache.spark.mllib.tree.RandomForest
>>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>>> import org.apache.spark.mllib.regression.LabeledPoint
>>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>>> import org.apache.spark.sql._
>>>> import org.apache.spark.sql.SQLContext
>>>> import org.apache.spark.sql.SparkSession
>>>>
>>>> import org.apache.spark.ml.feature.PCA
>>>> import org.apache.spark.rdd.RDD
>>>>
>>>> object PCAExample {
>>>>   def main(args: Array[String]): Unit = {
>>>>     val spark = SparkSession
>>>>       .builder
>>>>       .master("local[*]")
>>>>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>>       .appName(s"OneVsRestExample")
>>>>       .getOrCreate()
>>>>
>>>>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>>>
>>>>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>>     val (trainingData, testData) = (splits(0), splits(1))
>>>>
>>>>     val sqlContext = new SQLContext(spark.sparkContext)
>>>>     import sqlContext.implicits._
>>>>     val trainingDF = trainingData.toDF("label", "features")
>>>>
>>>>     val pca = new PCA()
>>>>       .setInputCol("features")
>>>>       .setOutputCol("pcaFeatures")
>>>>       .setK(100)
>>>>       .fit(trainingDF)
>>>>
>>>>     val pcaTrainingData = pca.transform(trainingDF)
>>>>     //pcaTrainingData.show()
>>>>
>>>>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>>       row.getAs[Double]("label"),
>>>>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>>
>>>>     //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>>>>     //  Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>>
>>>>     val numClasses = 10
>>>>     val categoricalFeaturesInfo = Map[Int, Int]()
>>>>     val numTrees = 10 // Use more in practice.
>>>>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>>     val impurity = "gini"
>>>>     val maxDepth = 20
>>>>     val maxBins = 32
>>>>
>>>>     val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>>>>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>>   }
>>>> }
>>>>
>>>> However, I'm getting the following error:
>>>>
>>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>>> requirement failed: Column features must be of type
>>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>>
>>>> What am I doing wrong in my code?  Actually, I'm getting the above
>>>> exception in this line:
>>>>
>>>> val pca = new PCA()
>>>>       .setInputCol("features")
>>>>       .setOutputCol("pcaFeatures")
>>>>       .setK(100)
>>>>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>>
>>>> Please, someone, help me to solve the problem.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>> *Md. Rezaul Karim*
>>>>
>>>
>>
>

Re: How to convert Spark MLlib vector to ML Vector?

Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
how about using

val dataset = spark.read.format("libsvm")
  .option("numFeatures", "780")
  .load("data/mllib/sample_libsvm_data.txt")

instead of
val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")





On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ry...@gmail.com> wrote:

> you could write a udf using the asML method along with some type casting,
> then apply the udf to data after pca.
>
> when using pipeline, that udf need to be wrapped in a customized
> transformer, I think.
>
> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
>> Why not use the RandomForest from Spark ML?
>>
>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>> rezaul.karim@insight-centre.org> wrote:
>>
>>> I have already posted this question to the StackOverflow
>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>> However, not getting any response from someone else. I'm trying to use
>>> RandomForest algorithm for the classification after applying the PCA
>>> technique since the dataset is pretty high-dimensional. Here's my source
>>> code:
>>>
>>> import org.apache.spark.mllib.util.MLUtils
>>> import org.apache.spark.mllib.tree.RandomForest
>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>> import org.apache.spark.mllib.regression.LabeledPoint
>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>> import org.apache.spark.sql._
>>> import org.apache.spark.sql.SQLContext
>>> import org.apache.spark.sql.SparkSession
>>>
>>> import org.apache.spark.ml.feature.PCA
>>> import org.apache.spark.rdd.RDD
>>>
>>> object PCAExample {
>>>   def main(args: Array[String]): Unit = {
>>>     val spark = SparkSession
>>>       .builder
>>>       .master("local[*]")
>>>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>       .appName(s"OneVsRestExample")
>>>       .getOrCreate()
>>>
>>>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>>
>>>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>     val (trainingData, testData) = (splits(0), splits(1))
>>>
>>>     val sqlContext = new SQLContext(spark.sparkContext)
>>>     import sqlContext.implicits._
>>>     val trainingDF = trainingData.toDF("label", "features")
>>>
>>>     val pca = new PCA()
>>>       .setInputCol("features")
>>>       .setOutputCol("pcaFeatures")
>>>       .setK(100)
>>>       .fit(trainingDF)
>>>
>>>     val pcaTrainingData = pca.transform(trainingDF)
>>>     //pcaTrainingData.show()
>>>
>>>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>       row.getAs[Double]("label"),
>>>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>
>>>     //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>>>     //  Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>
>>>     val numClasses = 10
>>>     val categoricalFeaturesInfo = Map[Int, Int]()
>>>     val numTrees = 10 // Use more in practice.
>>>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>     val impurity = "gini"
>>>     val maxDepth = 20
>>>     val maxBins = 32
>>>
>>>     val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>>>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>   }
>>> }
>>>
>>> However, I'm getting the following error:
>>>
>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>> requirement failed: Column features must be of type
>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>
>>> What am I doing wrong in my code?  Actually, I'm getting the above
>>> exception in this line:
>>>
>>> val pca = new PCA()
>>>       .setInputCol("features")
>>>       .setOutputCol("pcaFeatures")
>>>       .setK(100)
>>>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>
>>> Please, someone, help me to solve the problem.
>>>
>>>
>>>
>>>
>>>
>>> Kind regards,
>>> *Md. Rezaul Karim*
>>>
>>
>

Re: How to convert Spark MLlib vector to ML Vector?

Posted by Ryan <ry...@gmail.com>.
you could write a udf using the asML method along with some type casting,
then apply the udf to data after pca.

when using pipeline, that udf need to be wrapped in a customized
transformer, I think.

On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> Why not use the RandomForest from Spark ML?
>
> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
> rezaul.karim@insight-centre.org> wrote:
>
>> I have already posted this question to the StackOverflow
>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>> However, not getting any response from someone else. I'm trying to use
>> RandomForest algorithm for the classification after applying the PCA
>> technique since the dataset is pretty high-dimensional. Here's my source
>> code:
>>
>> import org.apache.spark.mllib.util.MLUtils
>> import org.apache.spark.mllib.tree.RandomForest
>> import org.apache.spark.mllib.tree.model.RandomForestModel
>> import org.apache.spark.mllib.regression.LabeledPoint
>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SparkSession
>>
>> import org.apache.spark.ml.feature.PCA
>> import org.apache.spark.rdd.RDD
>>
>> object PCAExample {
>>   def main(args: Array[String]): Unit = {
>>     val spark = SparkSession
>>       .builder
>>       .master("local[*]")
>>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>>       .appName(s"OneVsRestExample")
>>       .getOrCreate()
>>
>>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>
>>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>     val (trainingData, testData) = (splits(0), splits(1))
>>
>>     val sqlContext = new SQLContext(spark.sparkContext)
>>     import sqlContext.implicits._
>>     val trainingDF = trainingData.toDF("label", "features")
>>
>>     val pca = new PCA()
>>       .setInputCol("features")
>>       .setOutputCol("pcaFeatures")
>>       .setK(100)
>>       .fit(trainingDF)
>>
>>     val pcaTrainingData = pca.transform(trainingDF)
>>     //pcaTrainingData.show()
>>
>>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>       row.getAs[Double]("label"),
>>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>
>>     //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>>     //  Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>
>>     val numClasses = 10
>>     val categoricalFeaturesInfo = Map[Int, Int]()
>>     val numTrees = 10 // Use more in practice.
>>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>     val impurity = "gini"
>>     val maxDepth = 20
>>     val maxBins = 32
>>
>>     val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>   }
>> }
>>
>> However, I'm getting the following error:
>>
>> *Exception in thread "main" java.lang.IllegalArgumentException:
>> requirement failed: Column features must be of type
>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>
>> What am I doing wrong in my code?  Actually, I'm getting the above
>> exception in this line:
>>
>> val pca = new PCA()
>>       .setInputCol("features")
>>       .setOutputCol("pcaFeatures")
>>       .setK(100)
>>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>>
>> Please, someone, help me to solve the problem.
>>
>>
>>
>>
>>
>> Kind regards,
>> *Md. Rezaul Karim*
>>
>

Re: How to convert Spark MLlib vector to ML Vector?

Posted by Nick Pentreath <ni...@gmail.com>.
Why not use the RandomForest from Spark ML?

On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
rezaul.karim@insight-centre.org> wrote:

> I have already posted this question to the StackOverflow
> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
> However, not getting any response from someone else. I'm trying to use
> RandomForest algorithm for the classification after applying the PCA
> technique since the dataset is pretty high-dimensional. Here's my source
> code:
>
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.mllib.tree.RandomForest
> import org.apache.spark.mllib.tree.model.RandomForestModel
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
> import org.apache.spark.sql._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.SparkSession
>
> import org.apache.spark.ml.feature.PCA
> import org.apache.spark.rdd.RDD
>
> object PCAExample {
>   def main(args: Array[String]): Unit = {
>     val spark = SparkSession
>       .builder
>       .master("local[*]")
>       .config("spark.sql.warehouse.dir", "E:/Exp/")
>       .appName(s"OneVsRestExample")
>       .getOrCreate()
>
>     val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>     val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>     val (trainingData, testData) = (splits(0), splits(1))
>
>     val sqlContext = new SQLContext(spark.sparkContext)
>     import sqlContext.implicits._
>     val trainingDF = trainingData.toDF("label", "features")
>
>     val pca = new PCA()
>       .setInputCol("features")
>       .setOutputCol("pcaFeatures")
>       .setK(100)
>       .fit(trainingDF)
>
>     val pcaTrainingData = pca.transform(trainingDF)
>     //pcaTrainingData.show()
>
>     val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>       row.getAs[Double]("label"),
>       row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>
>     //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>     //  Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>
>     val numClasses = 10
>     val categoricalFeaturesInfo = Map[Int, Int]()
>     val numTrees = 10 // Use more in practice.
>     val featureSubsetStrategy = "auto" // Let the algorithm choose.
>     val impurity = "gini"
>     val maxDepth = 20
>     val maxBins = 32
>
>     val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>       numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>   }
> }
>
> However, I'm getting the following error:
>
> *Exception in thread "main" java.lang.IllegalArgumentException:
> requirement failed: Column features must be of type
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>
> What am I doing wrong in my code?  Actually, I'm getting the above
> exception in this line:
>
> val pca = new PCA()
>       .setInputCol("features")
>       .setOutputCol("pcaFeatures")
>       .setK(100)
>       .fit(trainingDF) /// GETTING EXCEPTION HERE
>
> Please, someone, help me to solve the problem.
>
>
>
>
>
> Kind regards,
> *Md. Rezaul Karim*
>