You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/05/31 04:22:19 UTC

RowSimilarityJob

What is the value created to describe simlarity by RowSimilarityJob? The 
paper which describes how the algorithm is implemented doesn't describe 
the various similarity values returned by mahout. It seems to focus on 
cooccurrences.

For SIMILARITY_COSINE is the value = cosine or 1 - cosine?

Is the value calculated after cooccurrences determines similar docs 
independently?

The code is very difficult to read so a little help would be appreciated.

Re: RowSimilarityJob

Posted by Suneel Marthi <su...@yahoo.com>.
To answer ur question Pat,  for SIMILARITY_COSINE the value returned = cosine.



________________________________
 From: Suneel Marthi <su...@yahoo.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Wednesday, May 30, 2012 11:22 PM
Subject: Re: RowSimilarityJob
 

Pat,

Here is an example from the output of the rowsimilarity job for a corpus I am working with (using Cosine Similarity).

Key: 25: Value: {27433:0.9999999999999994}


What this means is that Document# 26 is similar to Document# 27433by a factor of 0.999.

Since Distance = (1 - Similarity), this means that the distance between documents 25 and 27433 above is 0 (= 1 - 0.999), or in other words they are very similar.

Hope that clarifies.

Suneel



________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Wednesday, May 30, 2012 10:22 PM
Subject: RowSimilarityJob
 
What is the value created to describe simlarity by RowSimilarityJob? The paper which describes how the algorithm is implemented doesn't describe the various similarity values returned by mahout. It seems to focus on cooccurrences.

For SIMILARITY_COSINE is the value = cosine or 1 - cosine?

Is the value calculated after cooccurrences determines similar docs independently?

The code is very difficult to read so a little help would be appreciated.

Re: RowSimilarityJob

Posted by Suneel Marthi <su...@yahoo.com>.
Pat,

Here is an example from the output of the rowsimilarity job for a corpus I am working with (using Cosine Similarity).

Key: 25: Value: {27433:0.9999999999999994}


What this means is that Document# 26 is similar to Document# 27433by a factor of 0.999.

Since Distance = (1 - Similarity), this means that the distance between documents 25 and 27433 above is 0 (= 1 - 0.999), or in other words they are very similar.

Hope that clarifies.

Suneel



________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Wednesday, May 30, 2012 10:22 PM
Subject: RowSimilarityJob
 
What is the value created to describe simlarity by RowSimilarityJob? The paper which describes how the algorithm is implemented doesn't describe the various similarity values returned by mahout. It seems to focus on cooccurrences.

For SIMILARITY_COSINE is the value = cosine or 1 - cosine?

Is the value calculated after cooccurrences determines similar docs independently?

The code is very difficult to read so a little help would be appreciated.