You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by satyajit vegesna <sa...@gmail.com> on 2016/12/09 02:42:35 UTC

Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

Hi All,

PFB code.


import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by satyajit on 12/7/16.
  */
object DIMSUMusingtf extends App {

  val conf = new SparkConf()
    .setMaster("local[1]")
    .setAppName("testColsim")
  val sc = new SparkContext(conf)
  val spark = SparkSession
    .builder
    .appName("testColSim").getOrCreate()

  import org.apache.spark.ml.feature.Tokenizer

  val sentenceData = spark.createDataFrame(Seq(
    (0, "Hi I heard about Spark"),
    (0, "I wish Java could use case classes"),
    (1, "Logistic regression models are neat")
  )).toDF("label", "sentence")

  val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")

  val wordsData = tokenizer.transform(sentenceData)


  val hashingTF = new HashingTF()
    .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

  val featurizedData = hashingTF.transform(wordsData)


  val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
  val idfModel = idf.fit(featurizedData)
  val rescaledData = idfModel.transform(featurizedData)
  rescaledData.show()
  rescaledData.select("features", "label").take(3).foreach(println)
  val check = rescaledData.select("features")

  val row = check.rdd.map(row => row.getAs[SparseVector]("features"))

  val mat = new RowMatrix(row) //i am basically trying to use
Dense.vector as a direct input to

rowMatrix, but i get an error that RowMatrix Cannot resolve constructor

  row.foreach(println)
}

Any help would be appreciated.

Regards,
Satyajit.

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

Posted by Nick Pentreath <ni...@gmail.com>.

Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.

I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna <sa...@gmail.com>
wrote:

> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
>   * Created by satyajit on 12/7/16.
>   */
> object DIMSUMusingtf extends App {
>
>   val conf = new SparkConf()
>     .setMaster("local[1]")
>     .setAppName("testColsim")
>   val sc = new SparkContext(conf)
>   val spark = SparkSession
>     .builder
>     .appName("testColSim").getOrCreate()
>
>   import org.apache.spark.ml.feature.Tokenizer
>
>   val sentenceData = spark.createDataFrame(Seq(
>     (0, "Hi I heard about Spark"),
>     (0, "I wish Java could use case classes"),
>     (1, "Logistic regression models are neat")
>   )).toDF("label", "sentence")
>
>   val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
>
>   val wordsData = tokenizer.transform(sentenceData)
>
>
>   val hashingTF = new HashingTF()
>     .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
>   val featurizedData = hashingTF.transform(wordsData)
>
>
>   val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
>   val idfModel = idf.fit(featurizedData)
>   val rescaledData = idfModel.transform(featurizedData)
>   rescaledData.show()
>   rescaledData.select("features", "label").take(3).foreach(println)
>   val check = rescaledData.select("features")
>
>   val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
>   val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
>   row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

Posted by Nick Pentreath <ni...@gmail.com>.

Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.

I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna <sa...@gmail.com>
wrote:

> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
>   * Created by satyajit on 12/7/16.
>   */
> object DIMSUMusingtf extends App {
>
>   val conf = new SparkConf()
>     .setMaster("local[1]")
>     .setAppName("testColsim")
>   val sc = new SparkContext(conf)
>   val spark = SparkSession
>     .builder
>     .appName("testColSim").getOrCreate()
>
>   import org.apache.spark.ml.feature.Tokenizer
>
>   val sentenceData = spark.createDataFrame(Seq(
>     (0, "Hi I heard about Spark"),
>     (0, "I wish Java could use case classes"),
>     (1, "Logistic regression models are neat")
>   )).toDF("label", "sentence")
>
>   val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
>
>   val wordsData = tokenizer.transform(sentenceData)
>
>
>   val hashingTF = new HashingTF()
>     .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
>   val featurizedData = hashingTF.transform(wordsData)
>
>
>   val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
>   val idfModel = idf.fit(featurizedData)
>   val rescaledData = idfModel.transform(featurizedData)
>   rescaledData.show()
>   rescaledData.select("features", "label").take(3).foreach(println)
>   val check = rescaledData.select("features")
>
>   val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
>   val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
>   row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

Posted by viirya <vi...@gmail.com>.

Hi Satyajit,

You can just import mllib's Vectors (org.apache.spark.mllib.linalg.Vectors)
and call its fromML method to convert ml's Vector to mllib's Vector.

For example:

import org.apache.spark.mllib.linalg.{Vectors => OldVectors}

val row = check.rdd.map(row =>
OldVectors.fromML(row.getAs[SparseVector]("features")))
val mat = new RowMatrix(row)

 



-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Issue-in-using-DenseVector-in-RowMatrix-error-could-be-due-to-ml-and-mllib-package-changes-tp20182p20188.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org