You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Huang,Jin" <hu...@gmail.com> on 2014/12/09 23:12:35 UTC

implement query to sparse vector representation in spark

I know quite a lot about machine learning, but new to scala and spark. Got stuck due to Spark API, so please advise.

I have a txt file with each line format like this

#label \t   # query, a strong of words, delimited by space
1  wireless amazon kindle

2  apple iPhone 5

1  kindle fire 8G

2  apple iPad
first field is the label, second field is the string My plan is to split the data into label and feature, transform the string into sparse vector using build in function Word2Vec(I assume it is using bag of words to get dict first), then classify using SVMWithSGD to train

object QueryClassification {


  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Query Classification").setMaster("local")
    val sc = new SparkContext(conf)
    val input = sc.textFile("spark_data.txt")

    val word2vec = new Word2Vec()

    val parsedData = input.map {line =>
      val parts = line.split("\t")

      ## How to write code here? I need to parse into feature vector 
      ## properly and then apply word2vec function after the map
      *LabeledPoint(parts(0).toDouble, ????)*   
    }

    ## * is the item I got from parsing parts(1) above
    word2vec.fit(*)  





    val numIterations = 20
    val model = SVMWithSGD.train(parsedData,numIterations)


  }
}
Thanks a lot