You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Md. Rezaul Karim" <re...@insight-centre.org> on 2017/04/09 14:01:13 UTC
How to convert Spark MLlib vector to ML Vector?
I have already posted this question to the StackOverflow
<http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
However, not getting any response from someone else. I'm trying to use
RandomForest algorithm for the classification after applying the PCA
technique since the dataset is pretty high-dimensional. Here's my source
code:
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.PCA
import org.apache.spark.rdd.RDD
object PCAExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "E:/Exp/")
.appName(s"OneVsRestExample")
.getOrCreate()
val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
val (trainingData, testData) = (splits(0), splits(1))
val sqlContext = new SQLContext(spark.sparkContext)
import sqlContext.implicits._
val trainingDF = trainingData.toDF("label", "features")
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDF)
val pcaTrainingData = pca.transform(trainingDF)
//pcaTrainingData.show()
val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
row.getAs[Double]("label"),
row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
//val labeled = pca.transform(trainingDF).rdd.map(row =>
LabeledPoint(row.getAs[Double]("label"),
// Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32
val model = RandomForest.trainClassifier(labeled, numClasses,
categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
}
}
However, I'm getting the following error:
*Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: Column features must be of type
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
What am I doing wrong in my code? Actually, I'm getting the above
exception in this line:
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(100)
.fit(trainingDF) /// GETTING EXCEPTION HERE
Please, someone, help me to solve the problem.
Kind regards,
*Md. Rezaul Karim*
Re: How to convert Spark MLlib vector to ML Vector?
Posted by "Md. Rezaul Karim" <re...@insight-centre.org>.
Hi Yan, Ryan, and Nick,
Actually, for a special use case, I had to use RDD-based Spark MLlib which
did not work eventually. Therefore, I had to switch to Spark ML later on.
Thanks for your support guys.
Regards,
_________________________________
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>
On 10 April 2017 at 06:45, 颜发才(Yan Facai) <fa...@gmail.com> wrote:
> how about using
>
> val dataset = spark.read.format("libsvm")
> .option("numFeatures", "780")
> .load("data/mllib/sample_libsvm_data.txt")
>
> instead of
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>
>
>
>
> On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ry...@gmail.com> wrote:
>
>> you could write a udf using the asML method along with some type casting,
>> then apply the udf to data after pca.
>>
>> when using pipeline, that udf need to be wrapped in a customized
>> transformer, I think.
>>
>> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Why not use the RandomForest from Spark ML?
>>>
>>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>>> rezaul.karim@insight-centre.org> wrote:
>>>
>>>> I have already posted this question to the StackOverflow
>>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>>> However, not getting any response from someone else. I'm trying to use
>>>> RandomForest algorithm for the classification after applying the PCA
>>>> technique since the dataset is pretty high-dimensional. Here's my source
>>>> code:
>>>>
>>>> import org.apache.spark.mllib.util.MLUtils
>>>> import org.apache.spark.mllib.tree.RandomForest
>>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>>> import org.apache.spark.mllib.regression.LabeledPoint
>>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>>> import org.apache.spark.sql._
>>>> import org.apache.spark.sql.SQLContext
>>>> import org.apache.spark.sql.SparkSession
>>>>
>>>> import org.apache.spark.ml.feature.PCA
>>>> import org.apache.spark.rdd.RDD
>>>>
>>>> object PCAExample {
>>>> def main(args: Array[String]): Unit = {
>>>> val spark = SparkSession
>>>> .builder
>>>> .master("local[*]")
>>>> .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>> .appName(s"OneVsRestExample")
>>>> .getOrCreate()
>>>>
>>>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>>>
>>>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>> val (trainingData, testData) = (splits(0), splits(1))
>>>>
>>>> val sqlContext = new SQLContext(spark.sparkContext)
>>>> import sqlContext.implicits._
>>>> val trainingDF = trainingData.toDF("label", "features")
>>>>
>>>> val pca = new PCA()
>>>> .setInputCol("features")
>>>> .setOutputCol("pcaFeatures")
>>>> .setK(100)
>>>> .fit(trainingDF)
>>>>
>>>> val pcaTrainingData = pca.transform(trainingDF)
>>>> //pcaTrainingData.show()
>>>>
>>>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>> row.getAs[Double]("label"),
>>>> row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>>
>>>> //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>>>> // Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>>
>>>> val numClasses = 10
>>>> val categoricalFeaturesInfo = Map[Int, Int]()
>>>> val numTrees = 10 // Use more in practice.
>>>> val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>> val impurity = "gini"
>>>> val maxDepth = 20
>>>> val maxBins = 32
>>>>
>>>> val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>>>> numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>> }
>>>> }
>>>>
>>>> However, I'm getting the following error:
>>>>
>>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>>> requirement failed: Column features must be of type
>>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>>
>>>> What am I doing wrong in my code? Actually, I'm getting the above
>>>> exception in this line:
>>>>
>>>> val pca = new PCA()
>>>> .setInputCol("features")
>>>> .setOutputCol("pcaFeatures")
>>>> .setK(100)
>>>> .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>>
>>>> Please, someone, help me to solve the problem.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>> *Md. Rezaul Karim*
>>>>
>>>
>>
>
Re: How to convert Spark MLlib vector to ML Vector?
Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
By the way, always try to use `ml`, instead of `mllib`.
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.classification.RandomForestClassifier
or
import org.apache.spark.ml.regression.RandomForestRegressor
more details, see
http://spark.apache.org/docs/latest/ml-classification-regression.html.
On Mon, Apr 10, 2017 at 1:45 PM, 颜发才(Yan Facai) <fa...@gmail.com> wrote:
> how about using
>
> val dataset = spark.read.format("libsvm")
> .option("numFeatures", "780")
> .load("data/mllib/sample_libsvm_data.txt")
>
> instead of
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>
>
>
>
> On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ry...@gmail.com> wrote:
>
>> you could write a udf using the asML method along with some type casting,
>> then apply the udf to data after pca.
>>
>> when using pipeline, that udf need to be wrapped in a customized
>> transformer, I think.
>>
>> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> Why not use the RandomForest from Spark ML?
>>>
>>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>>> rezaul.karim@insight-centre.org> wrote:
>>>
>>>> I have already posted this question to the StackOverflow
>>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>>> However, not getting any response from someone else. I'm trying to use
>>>> RandomForest algorithm for the classification after applying the PCA
>>>> technique since the dataset is pretty high-dimensional. Here's my source
>>>> code:
>>>>
>>>> import org.apache.spark.mllib.util.MLUtils
>>>> import org.apache.spark.mllib.tree.RandomForest
>>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>>> import org.apache.spark.mllib.regression.LabeledPoint
>>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>>> import org.apache.spark.sql._
>>>> import org.apache.spark.sql.SQLContext
>>>> import org.apache.spark.sql.SparkSession
>>>>
>>>> import org.apache.spark.ml.feature.PCA
>>>> import org.apache.spark.rdd.RDD
>>>>
>>>> object PCAExample {
>>>> def main(args: Array[String]): Unit = {
>>>> val spark = SparkSession
>>>> .builder
>>>> .master("local[*]")
>>>> .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>> .appName(s"OneVsRestExample")
>>>> .getOrCreate()
>>>>
>>>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>>>
>>>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>> val (trainingData, testData) = (splits(0), splits(1))
>>>>
>>>> val sqlContext = new SQLContext(spark.sparkContext)
>>>> import sqlContext.implicits._
>>>> val trainingDF = trainingData.toDF("label", "features")
>>>>
>>>> val pca = new PCA()
>>>> .setInputCol("features")
>>>> .setOutputCol("pcaFeatures")
>>>> .setK(100)
>>>> .fit(trainingDF)
>>>>
>>>> val pcaTrainingData = pca.transform(trainingDF)
>>>> //pcaTrainingData.show()
>>>>
>>>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>> row.getAs[Double]("label"),
>>>> row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>>
>>>> //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>>>> // Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>>
>>>> val numClasses = 10
>>>> val categoricalFeaturesInfo = Map[Int, Int]()
>>>> val numTrees = 10 // Use more in practice.
>>>> val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>> val impurity = "gini"
>>>> val maxDepth = 20
>>>> val maxBins = 32
>>>>
>>>> val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>>>> numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>> }
>>>> }
>>>>
>>>> However, I'm getting the following error:
>>>>
>>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>>> requirement failed: Column features must be of type
>>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>>
>>>> What am I doing wrong in my code? Actually, I'm getting the above
>>>> exception in this line:
>>>>
>>>> val pca = new PCA()
>>>> .setInputCol("features")
>>>> .setOutputCol("pcaFeatures")
>>>> .setK(100)
>>>> .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>>
>>>> Please, someone, help me to solve the problem.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>> *Md. Rezaul Karim*
>>>>
>>>
>>
>
Re: How to convert Spark MLlib vector to ML Vector?
Posted by "颜发才 (Yan Facai)" <fa...@gmail.com>.
how about using
val dataset = spark.read.format("libsvm")
.option("numFeatures", "780")
.load("data/mllib/sample_libsvm_data.txt")
instead of
val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ry...@gmail.com> wrote:
> you could write a udf using the asML method along with some type casting,
> then apply the udf to data after pca.
>
> when using pipeline, that udf need to be wrapped in a customized
> transformer, I think.
>
> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
>> Why not use the RandomForest from Spark ML?
>>
>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>> rezaul.karim@insight-centre.org> wrote:
>>
>>> I have already posted this question to the StackOverflow
>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>> However, not getting any response from someone else. I'm trying to use
>>> RandomForest algorithm for the classification after applying the PCA
>>> technique since the dataset is pretty high-dimensional. Here's my source
>>> code:
>>>
>>> import org.apache.spark.mllib.util.MLUtils
>>> import org.apache.spark.mllib.tree.RandomForest
>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>> import org.apache.spark.mllib.regression.LabeledPoint
>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>> import org.apache.spark.sql._
>>> import org.apache.spark.sql.SQLContext
>>> import org.apache.spark.sql.SparkSession
>>>
>>> import org.apache.spark.ml.feature.PCA
>>> import org.apache.spark.rdd.RDD
>>>
>>> object PCAExample {
>>> def main(args: Array[String]): Unit = {
>>> val spark = SparkSession
>>> .builder
>>> .master("local[*]")
>>> .config("spark.sql.warehouse.dir", "E:/Exp/")
>>> .appName(s"OneVsRestExample")
>>> .getOrCreate()
>>>
>>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>>
>>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>> val (trainingData, testData) = (splits(0), splits(1))
>>>
>>> val sqlContext = new SQLContext(spark.sparkContext)
>>> import sqlContext.implicits._
>>> val trainingDF = trainingData.toDF("label", "features")
>>>
>>> val pca = new PCA()
>>> .setInputCol("features")
>>> .setOutputCol("pcaFeatures")
>>> .setK(100)
>>> .fit(trainingDF)
>>>
>>> val pcaTrainingData = pca.transform(trainingDF)
>>> //pcaTrainingData.show()
>>>
>>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>> row.getAs[Double]("label"),
>>> row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>
>>> //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>>> // Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>>
>>> val numClasses = 10
>>> val categoricalFeaturesInfo = Map[Int, Int]()
>>> val numTrees = 10 // Use more in practice.
>>> val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>> val impurity = "gini"
>>> val maxDepth = 20
>>> val maxBins = 32
>>>
>>> val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>>> numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>> }
>>> }
>>>
>>> However, I'm getting the following error:
>>>
>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>> requirement failed: Column features must be of type
>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>
>>> What am I doing wrong in my code? Actually, I'm getting the above
>>> exception in this line:
>>>
>>> val pca = new PCA()
>>> .setInputCol("features")
>>> .setOutputCol("pcaFeatures")
>>> .setK(100)
>>> .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>
>>> Please, someone, help me to solve the problem.
>>>
>>>
>>>
>>>
>>>
>>> Kind regards,
>>> *Md. Rezaul Karim*
>>>
>>
>
Re: How to convert Spark MLlib vector to ML Vector?
Posted by Ryan <ry...@gmail.com>.
you could write a udf using the asML method along with some type casting,
then apply the udf to data after pca.
when using pipeline, that udf need to be wrapped in a customized
transformer, I think.
On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <ni...@gmail.com>
wrote:
> Why not use the RandomForest from Spark ML?
>
> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
> rezaul.karim@insight-centre.org> wrote:
>
>> I have already posted this question to the StackOverflow
>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>> However, not getting any response from someone else. I'm trying to use
>> RandomForest algorithm for the classification after applying the PCA
>> technique since the dataset is pretty high-dimensional. Here's my source
>> code:
>>
>> import org.apache.spark.mllib.util.MLUtils
>> import org.apache.spark.mllib.tree.RandomForest
>> import org.apache.spark.mllib.tree.model.RandomForestModel
>> import org.apache.spark.mllib.regression.LabeledPoint
>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>> import org.apache.spark.sql._
>> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SparkSession
>>
>> import org.apache.spark.ml.feature.PCA
>> import org.apache.spark.rdd.RDD
>>
>> object PCAExample {
>> def main(args: Array[String]): Unit = {
>> val spark = SparkSession
>> .builder
>> .master("local[*]")
>> .config("spark.sql.warehouse.dir", "E:/Exp/")
>> .appName(s"OneVsRestExample")
>> .getOrCreate()
>>
>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>>
>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>> val (trainingData, testData) = (splits(0), splits(1))
>>
>> val sqlContext = new SQLContext(spark.sparkContext)
>> import sqlContext.implicits._
>> val trainingDF = trainingData.toDF("label", "features")
>>
>> val pca = new PCA()
>> .setInputCol("features")
>> .setOutputCol("pcaFeatures")
>> .setK(100)
>> .fit(trainingDF)
>>
>> val pcaTrainingData = pca.transform(trainingDF)
>> //pcaTrainingData.show()
>>
>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>> row.getAs[Double]("label"),
>> row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>
>> //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
>> // Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>>
>> val numClasses = 10
>> val categoricalFeaturesInfo = Map[Int, Int]()
>> val numTrees = 10 // Use more in practice.
>> val featureSubsetStrategy = "auto" // Let the algorithm choose.
>> val impurity = "gini"
>> val maxDepth = 20
>> val maxBins = 32
>>
>> val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
>> numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>> }
>> }
>>
>> However, I'm getting the following error:
>>
>> *Exception in thread "main" java.lang.IllegalArgumentException:
>> requirement failed: Column features must be of type
>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>
>> What am I doing wrong in my code? Actually, I'm getting the above
>> exception in this line:
>>
>> val pca = new PCA()
>> .setInputCol("features")
>> .setOutputCol("pcaFeatures")
>> .setK(100)
>> .fit(trainingDF) /// GETTING EXCEPTION HERE
>>
>> Please, someone, help me to solve the problem.
>>
>>
>>
>>
>>
>> Kind regards,
>> *Md. Rezaul Karim*
>>
>
Re: How to convert Spark MLlib vector to ML Vector?
Posted by Nick Pentreath <ni...@gmail.com>.
Why not use the RandomForest from Spark ML?
On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
rezaul.karim@insight-centre.org> wrote:
> I have already posted this question to the StackOverflow
> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
> However, not getting any response from someone else. I'm trying to use
> RandomForest algorithm for the classification after applying the PCA
> technique since the dataset is pretty high-dimensional. Here's my source
> code:
>
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.mllib.tree.RandomForest
> import org.apache.spark.mllib.tree.model.RandomForestModel
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
> import org.apache.spark.sql._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.SparkSession
>
> import org.apache.spark.ml.feature.PCA
> import org.apache.spark.rdd.RDD
>
> object PCAExample {
> def main(args: Array[String]): Unit = {
> val spark = SparkSession
> .builder
> .master("local[*]")
> .config("spark.sql.warehouse.dir", "E:/Exp/")
> .appName(s"OneVsRestExample")
> .getOrCreate()
>
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
> val (trainingData, testData) = (splits(0), splits(1))
>
> val sqlContext = new SQLContext(spark.sparkContext)
> import sqlContext.implicits._
> val trainingDF = trainingData.toDF("label", "features")
>
> val pca = new PCA()
> .setInputCol("features")
> .setOutputCol("pcaFeatures")
> .setK(100)
> .fit(trainingDF)
>
> val pcaTrainingData = pca.transform(trainingDF)
> //pcaTrainingData.show()
>
> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
> row.getAs[Double]("label"),
> row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>
> //val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(row.getAs[Double]("label"),
> // Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))))
>
> val numClasses = 10
> val categoricalFeaturesInfo = Map[Int, Int]()
> val numTrees = 10 // Use more in practice.
> val featureSubsetStrategy = "auto" // Let the algorithm choose.
> val impurity = "gini"
> val maxDepth = 20
> val maxBins = 32
>
> val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
> numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
> }
> }
>
> However, I'm getting the following error:
>
> *Exception in thread "main" java.lang.IllegalArgumentException:
> requirement failed: Column features must be of type
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>
> What am I doing wrong in my code? Actually, I'm getting the above
> exception in this line:
>
> val pca = new PCA()
> .setInputCol("features")
> .setOutputCol("pcaFeatures")
> .setK(100)
> .fit(trainingDF) /// GETTING EXCEPTION HERE
>
> Please, someone, help me to solve the problem.
>
>
>
>
>
> Kind regards,
> *Md. Rezaul Karim*
>