You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jozua Sijsling <jo...@gmail.com> on 2014/08/26 12:51:13 UTC

ItemSimilarity based on k-means results

Dear Mahout users,


I am using a GenericBooleanPrefItemBasedRecommender where item similarity
is predetermined and based on k-means clustering results. I'm finding it
hard to rate the similarity of items even with the cluster data.

Given two items in the same cluster, I attempt to rate their similarity
based on the following calculation:

*double* distanceSquared = vector1.getDistanceSquared(vector2);
*double* radiusSquared = radius.getLengthSquared();
*double* diameterSquared = radiusSquared * 4;
*double* distanceRate = (diameterSquared - distanceSquared) /
diameterSquared;
*double* similarity = distanceRate * 2 - 1;
*double* result = Math.max(-1, Math.min(similarity, 1));

Note that the clustered points are tf-idf vectors where each vector will
have thousands of dimensions.

I find that with this simple calculation I am making the false assumption
that *radius.getLengthSquared() * 4* is a valid means of determining the
maximum squared distance for any two points in a cluster. This might have
worked if the clusters were n-spheres having the same radius on every axis.
But that is not the case. Often *distanceSquared* will be much larger than
diameterSquared, so my assumption fails and so does my calculation.

Given that my approach does not work. How should I determine a 'similarity
score' based on in-cluster distance? Working with ~40k dimensions I fear
that any optimal solution is still going to be very slow, so I am perfectly
okay with a heuristic. But I don't know where to get started.

I am not a mathematician or even skilled in algorithms. Any advice on how
to approach this is very much appreciated.


Kind regards,
Jozua Sijsling

Re: ItemSimilarity based on k-means results

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Can you tell us what you are trying to do?

You have TF-IDF weighted vectors of terms so you may be trying to calculate textual content based recommendations? In other words you are trying to show similar documents when a user looks at a document? With 1000s of dimensions this means 1000s or unique terms so these must be pretty big docs.

You can do this by encoding your vectors into a Mahout DRM (distributed row matrix) and running RowSimilarityJob on it using SIMILARITY_COSINE. You will get a list of similar docs for each doc. No need for GenericBooleanPrefItemBasedRecommender or clustering.

On Aug 26, 2014, at 3:51 AM, Jozua Sijsling <jo...@gmail.com> wrote:

Dear Mahout users,


I am using a GenericBooleanPrefItemBasedRecommender where item similarity
is predetermined and based on k-means clustering results. I'm finding it
hard to rate the similarity of items even with the cluster data.

Given two items in the same cluster, I attempt to rate their similarity
based on the following calculation:

*double* distanceSquared = vector1.getDistanceSquared(vector2);
*double* radiusSquared = radius.getLengthSquared();
*double* diameterSquared = radiusSquared * 4;
*double* distanceRate = (diameterSquared - distanceSquared) /
diameterSquared;
*double* similarity = distanceRate * 2 - 1;
*double* result = Math.max(-1, Math.min(similarity, 1));

Note that the clustered points are tf-idf vectors where each vector will
have thousands of dimensions.

I find that with this simple calculation I am making the false assumption
that *radius.getLengthSquared() * 4* is a valid means of determining the
maximum squared distance for any two points in a cluster. This might have
worked if the clusters were n-spheres having the same radius on every axis.
But that is not the case. Often *distanceSquared* will be much larger than
diameterSquared, so my assumption fails and so does my calculation.

Given that my approach does not work. How should I determine a 'similarity
score' based on in-cluster distance? Working with ~40k dimensions I fear
that any optimal solution is still going to be very slow, so I am perfectly
okay with a heuristic. But I don't know where to get started.

I am not a mathematician or even skilled in algorithms. Any advice on how
to approach this is very much appreciated.


Kind regards,
Jozua Sijsling