You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by satyajit vegesna <sa...@gmail.com> on 2016/12/09 02:42:35 UTC
Issue in using DenseVector in RowMatrix, error could be due to ml and
mllib package changes
Hi All,
PFB code.
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by satyajit on 12/7/16.
*/
object DIMSUMusingtf extends App {
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("testColsim")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder
.appName("testColSim").getOrCreate()
import org.apache.spark.ml.feature.Tokenizer
val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.show()
rescaledData.select("features", "label").take(3).foreach(println)
val check = rescaledData.select("features")
val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
val mat = new RowMatrix(row) //i am basically trying to use
Dense.vector as a direct input to
rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
row.foreach(println)
}
Any help would be appreciated.
Regards,
Satyajit.
Re: Issue in using DenseVector in RowMatrix, error could be due to ml
and mllib package changes
Posted by Nick Pentreath <ni...@gmail.com>.
Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.
I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna <sa...@gmail.com>
wrote:
> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
> * Created by satyajit on 12/7/16.
> */
> object DIMSUMusingtf extends App {
>
> val conf = new SparkConf()
> .setMaster("local[1]")
> .setAppName("testColsim")
> val sc = new SparkContext(conf)
> val spark = SparkSession
> .builder
> .appName("testColSim").getOrCreate()
>
> import org.apache.spark.ml.feature.Tokenizer
>
> val sentenceData = spark.createDataFrame(Seq(
> (0, "Hi I heard about Spark"),
> (0, "I wish Java could use case classes"),
> (1, "Logistic regression models are neat")
> )).toDF("label", "sentence")
>
> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
>
> val wordsData = tokenizer.transform(sentenceData)
>
>
> val hashingTF = new HashingTF()
> .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
> val featurizedData = hashingTF.transform(wordsData)
>
>
> val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
> val idfModel = idf.fit(featurizedData)
> val rescaledData = idfModel.transform(featurizedData)
> rescaledData.show()
> rescaledData.select("features", "label").take(3).foreach(println)
> val check = rescaledData.select("features")
>
> val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
> val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
> row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>
Re: Issue in using DenseVector in RowMatrix, error could be due to ml
and mllib package changes
Posted by Nick Pentreath <ni...@gmail.com>.
Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.
I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna <sa...@gmail.com>
wrote:
> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
> * Created by satyajit on 12/7/16.
> */
> object DIMSUMusingtf extends App {
>
> val conf = new SparkConf()
> .setMaster("local[1]")
> .setAppName("testColsim")
> val sc = new SparkContext(conf)
> val spark = SparkSession
> .builder
> .appName("testColSim").getOrCreate()
>
> import org.apache.spark.ml.feature.Tokenizer
>
> val sentenceData = spark.createDataFrame(Seq(
> (0, "Hi I heard about Spark"),
> (0, "I wish Java could use case classes"),
> (1, "Logistic regression models are neat")
> )).toDF("label", "sentence")
>
> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
>
> val wordsData = tokenizer.transform(sentenceData)
>
>
> val hashingTF = new HashingTF()
> .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
> val featurizedData = hashingTF.transform(wordsData)
>
>
> val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
> val idfModel = idf.fit(featurizedData)
> val rescaledData = idfModel.transform(featurizedData)
> rescaledData.show()
> rescaledData.select("features", "label").take(3).foreach(println)
> val check = rescaledData.select("features")
>
> val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
> val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
> row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>
Re: Issue in using DenseVector in RowMatrix, error could be due to
ml and mllib package changes
Posted by viirya <vi...@gmail.com>.
Hi Satyajit,
You can just import mllib's Vectors (org.apache.spark.mllib.linalg.Vectors)
and call its fromML method to convert ml's Vector to mllib's Vector.
For example:
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
val row = check.rdd.map(row =>
OldVectors.fromML(row.getAs[SparseVector]("features")))
val mat = new RowMatrix(row)
-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Issue-in-using-DenseVector-in-RowMatrix-error-could-be-due-to-ml-and-mllib-package-changes-tp20182p20188.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org