You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Karl Higley <km...@gmail.com> on 2016/02/03 20:28:28 UTC

Re: Product similarity with TF/IDF and Cosine similarity (DIMSUM)

Hi Alan,

I'm slow responding, so you may have already figured this out. Just in
case, though:

val approx = mat.columnSimilarities(0.1)
approxEntries.first()
    res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)

The above is returning the cosine similarity between columns 1638 and
966248. Since you're providing documents as rows, this is conceptually
something like the similarity between terms based on which documents they
occur in.

In order to get the similarity between documents based on the terms they
contain, you'd need to build a RowMatrix where each row represents one term
and each column represents one document. One way to do that would be to
construct a CoordinateMatrix from your vectors, call transpose() on it,
then convert it to a RowMatrix via toRowMatrix().

Hope that helps!

Best,
Karl

On Sat, Jan 30, 2016 at 4:30 PM Alan Prando <al...@gmail.com> wrote:

> Hi Folks!
>
> I am trying to implement a spark job to calculate the similarity of my
> database products, using only name and descriptions.
> I would like to use TF-IDF to represent my text data and cosine similarity
> to calculate all similarities.
>
> My goal is, after job completes, get all similarities as a list.
> For example:
> Prod1 = ((Prod2, 0.98), (Prod3, 0.88))
> Prod2 = ((Prod1, 0.98), (Prod4, 0.53))
> Prod3 = ((Prod1, 0.98))
> Prod4 = ((Prod1, 0.53))
>
> However, I am new with Spark and I am having issues to use understanding
> what cosine similarity returns!
>
> My code:
>     val documents: RDD[Seq[String]] = sc.textFile(filename).map(_.split("
> ").toSeq)
>
>     val hashingTF = new HashingTF()
>     val tf: RDD[Vector] = hashingTF.transform(documents)
>     tf.cache()
>
>     val idf = new IDF(minDocFreq = 2).fit(tf)
>     val tfidf: RDD[Vector] = idf.transform(tf)
>
>     val mat = new RowMatrix(tfidf)
>
>     // Compute similar columns perfectly, with brute force.
>     val exact = mat.columnSimilarities()
>
>     // Compute similar columns with estimation using DIMSUM
>     val approx = mat.columnSimilarities(0.1)
>
>     val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) =>
> ((i, j), u) }
>     val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) =>
> ((i, j), v) }
>
> The file is just products name and description in each row.
>
> The return I got:
>     approxEntries.first()
>     res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)
>
> How can I figure out  what row this return is about?
>
> Thanks in advance! =]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>