You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pasquinell Urbani <pa...@exalitica.com> on 2016/05/20 16:45:54 UTC

Problems finding the original objects after HashingTF()

Hi all,

I'm following an TF-IDF example but I’m having some issues that i’m not
sure how to fix.

The input is the following

val test = sc.textFile("s3n://.../test_tfidf_products.txt")
test.collect.mkString("\n")

which prints

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[370] at textFile
at <console>:121 res241: String = a a b c d e b c d d

After that, I compute the ratings by doing

val test2 = test.map(_.split(" ").toSeq)
val hashingTF2 = new HashingTF()
val tf2: RDD[Vector] = hashingTF2.transform(test2)
tf2.cache()
val idf2 = new IDF().fit(tf2)
val tfidf2: RDD[Vector] = idf2.transform(tf2)
val expandedText = idfModel.transform(tf)
tfidf2.collect.mkString("\n")

which prints

(1048576,[97,98,99,100,101],[0.8109302162163288,0.0,0.0,0.0,0.4054651081081644])
(1048576,[98,99,100],[0.0,0.0,0.0])

The numbers [97,98,99,100,101] are indexes of the vector tfidf2.

I need to access the rating for example for item “a”, but the only way i
have been able to do this is using the method indexOf() of the hasingTF
object.

hashingTF2.indexOf("a")

res236: Int = 97


Is there a better way to perform this?


Thank you all.