You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Soheil Pourbafrani <> on 2018/11/02 19:47:55 UTC

Multiply Matrix to it's transpose get undesired output


I want to compute the cosine similarities of vectors using apache spark. In
a simple example, I created a vector from each document using built-in
tf-idf. Here is the code:

hashingTF = HashingTF(inputCol="tokenized", outputCol="tf")
tf = hashingTF.transform(df)

idf = IDF(inputCol="tf", outputCol="feature").fit(tf)
tfidf = idf.transform(tf)

normalizer = Normalizer(inputCol="feature", outputCol="norm")
data = normalizer.transform(tfidf)
mat = IndexedRowMatrix("id", "norm")\ row: IndexedRow(,
dot = mat.multiply(mat.transpose())

In the output, I expect it generates a matrix with Matrix diagonal of 1
(because each vector's similarity to itself is one) and its Matrix diagonal
is one, too (as desired).
[[1. 0.]
 [0. 1.]]

The problem is when I want to weight words in the vector space to something
other than typical TF-IDF. So I compute the vector space and create a
vector for each document that the index of document's words has new weights
and other than has weights zero.
for example the following vector is for document id 0.
(0, [9.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
3.3010299956639813, 3.3010299956639813, 0, 3.3010299956639813, 0,

The problem is when I try to compute cosine similarity of the matrix it
didn't produce the correct answer because the similarity of a document to
itself is not 1:

mat = IndexedRowMatrix( row: IndexedRow(row[0],
dot = mat.multiply(mat.transpose())

the output for the same dataset is :
[[124.58719613  81.        ]
 [ 81.         407.90397097]]

while with Spark TF-IDF approach it was :
[[1. 0.]
 [0. 1.]]

Where is wrong in my approach?