You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pasquinell Urbani <pa...@exalitica.com> on 2016/05/20 16:45:54 UTC
Problems finding the original objects after HashingTF()
Hi all,
I'm following an TF-IDF example but I’m having some issues that i’m not
sure how to fix.
The input is the following
val test = sc.textFile("s3n://.../test_tfidf_products.txt")
test.collect.mkString("\n")
which prints
test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[370] at textFile
at <console>:121 res241: String = a a b c d e b c d d
After that, I compute the ratings by doing
val test2 = test.map(_.split(" ").toSeq)
val hashingTF2 = new HashingTF()
val tf2: RDD[Vector] = hashingTF2.transform(test2)
tf2.cache()
val idf2 = new IDF().fit(tf2)
val tfidf2: RDD[Vector] = idf2.transform(tf2)
val expandedText = idfModel.transform(tf)
tfidf2.collect.mkString("\n")
which prints
(1048576,[97,98,99,100,101],[0.8109302162163288,0.0,0.0,0.0,0.4054651081081644])
(1048576,[98,99,100],[0.0,0.0,0.0])
The numbers [97,98,99,100,101] are indexes of the vector tfidf2.
I need to access the rating for example for item “a”, but the only way i
have been able to do this is using the method indexOf() of the hasingTF
object.
hashingTF2.indexOf("a")
res236: Int = 97
Is there a better way to perform this?
Thank you all.