You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jozua Sijsling <jo...@gmail.com> on 2014/08/26 12:51:13 UTC
ItemSimilarity based on k-means results
Dear Mahout users,
I am using a GenericBooleanPrefItemBasedRecommender where item similarity
is predetermined and based on k-means clustering results. I'm finding it
hard to rate the similarity of items even with the cluster data.
Given two items in the same cluster, I attempt to rate their similarity
based on the following calculation:
*double* distanceSquared = vector1.getDistanceSquared(vector2);
*double* radiusSquared = radius.getLengthSquared();
*double* diameterSquared = radiusSquared * 4;
*double* distanceRate = (diameterSquared - distanceSquared) /
diameterSquared;
*double* similarity = distanceRate * 2 - 1;
*double* result = Math.max(-1, Math.min(similarity, 1));
Note that the clustered points are tf-idf vectors where each vector will
have thousands of dimensions.
I find that with this simple calculation I am making the false assumption
that *radius.getLengthSquared() * 4* is a valid means of determining the
maximum squared distance for any two points in a cluster. This might have
worked if the clusters were n-spheres having the same radius on every axis.
But that is not the case. Often *distanceSquared* will be much larger than
diameterSquared, so my assumption fails and so does my calculation.
Given that my approach does not work. How should I determine a 'similarity
score' based on in-cluster distance? Working with ~40k dimensions I fear
that any optimal solution is still going to be very slow, so I am perfectly
okay with a heuristic. But I don't know where to get started.
I am not a mathematician or even skilled in algorithms. Any advice on how
to approach this is very much appreciated.
Kind regards,
Jozua Sijsling
Re: ItemSimilarity based on k-means results
Posted by Pat Ferrel <pa...@occamsmachete.com>.
Can you tell us what you are trying to do?
You have TF-IDF weighted vectors of terms so you may be trying to calculate textual content based recommendations? In other words you are trying to show similar documents when a user looks at a document? With 1000s of dimensions this means 1000s or unique terms so these must be pretty big docs.
You can do this by encoding your vectors into a Mahout DRM (distributed row matrix) and running RowSimilarityJob on it using SIMILARITY_COSINE. You will get a list of similar docs for each doc. No need for GenericBooleanPrefItemBasedRecommender or clustering.
On Aug 26, 2014, at 3:51 AM, Jozua Sijsling <jo...@gmail.com> wrote:
Dear Mahout users,
I am using a GenericBooleanPrefItemBasedRecommender where item similarity
is predetermined and based on k-means clustering results. I'm finding it
hard to rate the similarity of items even with the cluster data.
Given two items in the same cluster, I attempt to rate their similarity
based on the following calculation:
*double* distanceSquared = vector1.getDistanceSquared(vector2);
*double* radiusSquared = radius.getLengthSquared();
*double* diameterSquared = radiusSquared * 4;
*double* distanceRate = (diameterSquared - distanceSquared) /
diameterSquared;
*double* similarity = distanceRate * 2 - 1;
*double* result = Math.max(-1, Math.min(similarity, 1));
Note that the clustered points are tf-idf vectors where each vector will
have thousands of dimensions.
I find that with this simple calculation I am making the false assumption
that *radius.getLengthSquared() * 4* is a valid means of determining the
maximum squared distance for any two points in a cluster. This might have
worked if the clusters were n-spheres having the same radius on every axis.
But that is not the case. Often *distanceSquared* will be much larger than
diameterSquared, so my assumption fails and so does my calculation.
Given that my approach does not work. How should I determine a 'similarity
score' based on in-cluster distance? Working with ~40k dimensions I fear
that any optimal solution is still going to be very slow, so I am perfectly
okay with a heuristic. But I don't know where to get started.
I am not a mathematician or even skilled in algorithms. Any advice on how
to approach this is very much appreciated.
Kind regards,
Jozua Sijsling