You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2009/12/26 19:53:20 UTC

Cosine and Tanimoto Similarity

I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure) on
the following vector pairs

(-1, -1) and (3,3) Cosine : 2.0
Tanimoto: 1.2307692307692308
(1, 1)   and (3,3) Cosine : 0.0
Tanimoto: 0.5714285714285714
(1, 8)   and (8,1) Cosine : 0.7538461538461538
 Tanimoto: 0.8596491228070176

How should anti parallel vectors be treated in MAHOUT clustering packages.
is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
perpendicular vectors in the case of text data the vectors are positive. If
clustering of scientific data is done, what should be the default
behaviour. Since clustering is always trying find a configuration where the
distances are at the minimum? Since I have dealt mostly with Text data, I
would always try and get the abs value of cosine similarity before
subtracting from 1.0. Has anyone of you encountered a such a situation wrt
some particular dataset?
Robin

Re: Cosine and Tanimoto Similarity

Posted by Ted Dunning <te...@gmail.com>.
Floating point precision is not an issue with any of these metrics since the
counts you are dealing with are never large enough for the statistical
uncertainty (roughly sqrt(number of observations)) to outweigh the numerical
accuracy (roughly 10^-7 for float 10^-17 for double).

A much large problem is actually with small counts where coincidence can
give you cosine similarity of 1 or very close to that.  Log-likelihood ratio
testing is a fine way to mask away the measures unlikely to have good
results.  Even better is to not actually train those numbers based on actual
cooccurrence, but to use corpus frequency to weight events that have
interesting LLR.  Another way to deal with statistical noise is to use a
regularized learning system.

You need to deal with the problem one way or another.

On Sun, Dec 27, 2009 at 1:51 AM, Robin Anil <ro...@gmail.com> wrote:

> One thing I found very irritating when using cosine or numbers in the range
> 0,1 is that sometimes two distinct items have very small values of distance
> when you inspect them. I am always worried that precision of float is not
> enough to capture that small detail that makes the difference of accept or
> reject.  On the other hand Log likelihood similarity seem to have values in
> the range 100+, sometimes even 1000+ for strong likelihoods.
> Very unlikely events have small values <1.0
>
> In practice, it kind of holds, as the number of documents increase, I
> usually have to scale cosine to a larger range or switch to some hybrid
> similarity metric for good clustering.  What about you guys, I mean both of
> you have worked on huge data sets, what kind of insights can you share
> about
> what works and what doesnt.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Cosine and Tanimoto Similarity

Posted by Robin Anil <ro...@gmail.com>.
One thing I found very irritating when using cosine or numbers in the range
0,1 is that sometimes two distinct items have very small values of distance
when you inspect them. I am always worried that precision of float is not
enough to capture that small detail that makes the difference of accept or
reject.  On the other hand Log likelihood similarity seem to have values in
the range 100+, sometimes even 1000+ for strong likelihoods.
Very unlikely events have small values <1.0

In practice, it kind of holds, as the number of documents increase, I
usually have to scale cosine to a larger range or switch to some hybrid
similarity metric for good clustering.  What about you guys, I mean both of
you have worked on huge data sets, what kind of insights can you share about
what works and what doesnt.

Re: Cosine and Tanimoto Similarity

Posted by Ted Dunning <te...@gmail.com>.
As distance goes, I prefer either angle in the 0 to pi range or Euclidean
distance in the range 0 to 2.  You are correct that it is weird that most
things are at distance pi/2 or 1, but that is the price of living on an
n-sphere.

For similarity, the only thing that really matters is that 0 is really close
and things that are pretty close are closer than most other things.

(but this is really just what you said)

On Sat, Dec 26, 2009 at 3:04 PM, Jake Mannix <ja...@gmail.com> wrote:

> but two randomly chosen vectors will have "similarity" 0.5, which seems
> weird to me.
>
> I far prefer the set of vectors which are uncorrelated with a given vector
> to have similarity with it clustered around zero, because that makes
> intuitive
> sense to me.  Distance is something which is different, and living in a
> compact space makes distance kinda weird, since there is a maximum value,
> so scaling
> that to 1 is fine, and says that in general, two points chosen at random
> are
> distance 0.5 away from each other.
>
> I guess it depends on how you look at it.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Cosine and Tanimoto Similarity

Posted by Jake Mannix <ja...@gmail.com>.
On Sat, Dec 26, 2009 at 2:47 PM, Ted Dunning <te...@gmail.com> wrote:

> One minor additional point is that you might want to use (1-cos)/2 in order
> to get a result in [0,1].
>

For distance, yeah, this can be fine, but for vectors which can have
negative
components, I don't like doing that with similarity (where 'that' means
"forcing
the range to be [0,1]"), because then "perfect similarity" is 1 (good so
far),
"perfect dissimilarity" (aka anticorrelated/antiparallel) is 0 (still good),
but two
randomly chosen vectors will have "similarity" 0.5, which seems weird to me.

I far prefer the set of vectors which are uncorrelated with a given vector
to
have similarity with it clustered around zero, because that makes intuitive
sense to me.  Distance is something which is different, and living in a
compact
space makes distance kinda weird, since there is a maximum value, so scaling
that to 1 is fine, and says that in general, two points chosen at random
are
distance 0.5 away from each other.

I guess it depends on how you look at it.

  -jake


> On Sat, Dec 26, 2009 at 1:32 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > These are fine as distance measures.  It is also common to use
> > > sqrt(1-cos^2)
> > > which is more like an angle, but 1-cos is good enough for almost
> > anything.
> > >
> > > With normal text, btw, all of the coordinates are positive so the
> largest
> > > possible angle is pi/2 (cos = 0, sin = 1).
> > >
> >
> > I guess what I was saying is that if you take a less "normal"
> > representation
> > of text (a random projection, say, or a projection onto the SVD, etc.),
> you
> > can get negative similarities which make sense, and in this case you have
> > similarity == 1 for perfect alignment, 0 for uncorrelated, and -1 for
> > anti-parallel,
> > and you definitely *want* -1, not +1.
> >
> > Going with sqrt(2*(1-cos^2)) ~=~ theta  is only good for small angles -
> for
> > large angles this isn't so great anymore, and once the angle goes over
> > pi/2,
> > it's actually no longer monotonic and is doing most certainly the wrong
> > thing,
> > which is why I usually stick with 1-cos for distance if I'm not measuring
> > similarity.
> >
> > I guess my question to you, Robin, is why would you take the abs?  If the
> > data is text, then yes, in a normal representation your coefficients are
> > always
> > positive, and so all cosines are greater than zero, and there's no need
> to
> > take
> > abs, right?
> >
> > The only case where I'd imagine wanting to consider anti-parallel to be
> > basically
> > the same as parallel is in the collaborative filtering case, where as
> we've
> > discussed on this list in the past, sometimes a negative rating is as
> much
> > a measure of similarity as a positive one, and so if you've mean-centered
> > your
> > ratings, then you do want dot products which effectively take the abs as
> > well.
> >
> > I'd say that is the exception, not the norm, however.
> >
> >  -jake
> >
> >
> > >
> > > On Sat, Dec 26, 2009 at 10:53 AM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > > > I ran Cosine and tanimoto distance measure ( d = 1 - similarity
> > measure)
> > > on
> > > > the following vector pairs
> > > >
> > > > (-1, -1) and (3,3) Cosine : 2.0
> > > > Tanimoto: 1.2307692307692308
> > > > (1, 1)   and (3,3) Cosine : 0.0
> > > > Tanimoto: 0.5714285714285714
> > > > (1, 8)   and (8,1) Cosine : 0.7538461538461538
> > > >  Tanimoto: 0.8596491228070176
> > > >
> > > > How should anti parallel vectors be treated in MAHOUT clustering
> > > packages.
> > > > is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
> > > > perpendicular vectors in the case of text data the vectors are
> > positive.
> > > If
> > > > clustering of scientific data is done, what should be the default
> > > > behaviour. Since clustering is always trying find a configuration
> where
> > > the
> > > > distances are at the minimum? Since I have dealt mostly with Text
> data,
> > I
> > > > would always try and get the abs value of cosine similarity before
> > > > subtracting from 1.0. Has anyone of you encountered a such a
> situation
> > > wrt
> > > > some particular dataset?
> > > > Robin
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Cosine and Tanimoto Similarity

Posted by Ted Dunning <te...@gmail.com>.
One minor additional point is that you might want to use (1-cos)/2 in order
to get a result in [0,1].

On Sat, Dec 26, 2009 at 1:32 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > These are fine as distance measures.  It is also common to use
> > sqrt(1-cos^2)
> > which is more like an angle, but 1-cos is good enough for almost
> anything.
> >
> > With normal text, btw, all of the coordinates are positive so the largest
> > possible angle is pi/2 (cos = 0, sin = 1).
> >
>
> I guess what I was saying is that if you take a less "normal"
> representation
> of text (a random projection, say, or a projection onto the SVD, etc.), you
> can get negative similarities which make sense, and in this case you have
> similarity == 1 for perfect alignment, 0 for uncorrelated, and -1 for
> anti-parallel,
> and you definitely *want* -1, not +1.
>
> Going with sqrt(2*(1-cos^2)) ~=~ theta  is only good for small angles - for
> large angles this isn't so great anymore, and once the angle goes over
> pi/2,
> it's actually no longer monotonic and is doing most certainly the wrong
> thing,
> which is why I usually stick with 1-cos for distance if I'm not measuring
> similarity.
>
> I guess my question to you, Robin, is why would you take the abs?  If the
> data is text, then yes, in a normal representation your coefficients are
> always
> positive, and so all cosines are greater than zero, and there's no need to
> take
> abs, right?
>
> The only case where I'd imagine wanting to consider anti-parallel to be
> basically
> the same as parallel is in the collaborative filtering case, where as we've
> discussed on this list in the past, sometimes a negative rating is as much
> a measure of similarity as a positive one, and so if you've mean-centered
> your
> ratings, then you do want dot products which effectively take the abs as
> well.
>
> I'd say that is the exception, not the norm, however.
>
>  -jake
>
>
> >
> > On Sat, Dec 26, 2009 at 10:53 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > I ran Cosine and tanimoto distance measure ( d = 1 - similarity
> measure)
> > on
> > > the following vector pairs
> > >
> > > (-1, -1) and (3,3) Cosine : 2.0
> > > Tanimoto: 1.2307692307692308
> > > (1, 1)   and (3,3) Cosine : 0.0
> > > Tanimoto: 0.5714285714285714
> > > (1, 8)   and (8,1) Cosine : 0.7538461538461538
> > >  Tanimoto: 0.8596491228070176
> > >
> > > How should anti parallel vectors be treated in MAHOUT clustering
> > packages.
> > > is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
> > > perpendicular vectors in the case of text data the vectors are
> positive.
> > If
> > > clustering of scientific data is done, what should be the default
> > > behaviour. Since clustering is always trying find a configuration where
> > the
> > > distances are at the minimum? Since I have dealt mostly with Text data,
> I
> > > would always try and get the abs value of cosine similarity before
> > > subtracting from 1.0. Has anyone of you encountered a such a situation
> > wrt
> > > some particular dataset?
> > > Robin
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Cosine and Tanimoto Similarity

Posted by Jake Mannix <ja...@gmail.com>.
On Sat, Dec 26, 2009 at 12:18 PM, Ted Dunning <te...@gmail.com> wrote:

> These are fine as distance measures.  It is also common to use
> sqrt(1-cos^2)
> which is more like an angle, but 1-cos is good enough for almost anything.
>
> With normal text, btw, all of the coordinates are positive so the largest
> possible angle is pi/2 (cos = 0, sin = 1).
>

I guess what I was saying is that if you take a less "normal" representation
of text (a random projection, say, or a projection onto the SVD, etc.), you
can get negative similarities which make sense, and in this case you have
similarity == 1 for perfect alignment, 0 for uncorrelated, and -1 for
anti-parallel,
and you definitely *want* -1, not +1.

Going with sqrt(2*(1-cos^2)) ~=~ theta  is only good for small angles - for
large angles this isn't so great anymore, and once the angle goes over
pi/2,
it's actually no longer monotonic and is doing most certainly the wrong
thing,
which is why I usually stick with 1-cos for distance if I'm not measuring
similarity.

I guess my question to you, Robin, is why would you take the abs?  If the
data is text, then yes, in a normal representation your coefficients are
always
positive, and so all cosines are greater than zero, and there's no need to
take
abs, right?

The only case where I'd imagine wanting to consider anti-parallel to be
basically
the same as parallel is in the collaborative filtering case, where as we've
discussed on this list in the past, sometimes a negative rating is as much
a measure of similarity as a positive one, and so if you've mean-centered
your
ratings, then you do want dot products which effectively take the abs as
well.

I'd say that is the exception, not the norm, however.

  -jake


>
> On Sat, Dec 26, 2009 at 10:53 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure)
> on
> > the following vector pairs
> >
> > (-1, -1) and (3,3) Cosine : 2.0
> > Tanimoto: 1.2307692307692308
> > (1, 1)   and (3,3) Cosine : 0.0
> > Tanimoto: 0.5714285714285714
> > (1, 8)   and (8,1) Cosine : 0.7538461538461538
> >  Tanimoto: 0.8596491228070176
> >
> > How should anti parallel vectors be treated in MAHOUT clustering
> packages.
> > is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
> > perpendicular vectors in the case of text data the vectors are positive.
> If
> > clustering of scientific data is done, what should be the default
> > behaviour. Since clustering is always trying find a configuration where
> the
> > distances are at the minimum? Since I have dealt mostly with Text data, I
> > would always try and get the abs value of cosine similarity before
> > subtracting from 1.0. Has anyone of you encountered a such a situation
> wrt
> > some particular dataset?
> > Robin
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Cosine and Tanimoto Similarity

Posted by Ted Dunning <te...@gmail.com>.
These are fine as distance measures.  It is also common to use sqrt(1-cos^2)
which is more like an angle, but 1-cos is good enough for almost anything.

With normal text, btw, all of the coordinates are positive so the largest
possible angle is pi/2 (cos = 0, sin = 1).

On Sat, Dec 26, 2009 at 10:53 AM, Robin Anil <ro...@gmail.com> wrote:

> I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure) on
> the following vector pairs
>
> (-1, -1) and (3,3) Cosine : 2.0
> Tanimoto: 1.2307692307692308
> (1, 1)   and (3,3) Cosine : 0.0
> Tanimoto: 0.5714285714285714
> (1, 8)   and (8,1) Cosine : 0.7538461538461538
>  Tanimoto: 0.8596491228070176
>
> How should anti parallel vectors be treated in MAHOUT clustering packages.
> is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
> perpendicular vectors in the case of text data the vectors are positive. If
> clustering of scientific data is done, what should be the default
> behaviour. Since clustering is always trying find a configuration where the
> distances are at the minimum? Since I have dealt mostly with Text data, I
> would always try and get the abs value of cosine similarity before
> subtracting from 1.0. Has anyone of you encountered a such a situation wrt
> some particular dataset?
> Robin
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Cosine and Tanimoto Similarity

Posted by Robin Anil <ro...@gmail.com>.
Anti parallel concept doesnt come in text data. Where all the weights are
positive. Think about it, you really cant have a document where the word
apple occurs -3 times. But if you consider data which actually have -ve
weights(I also havent encounted any such). Then the measure is subject to
interpretation based on the data and the type of clustering we wish to
achieve.

Robin


On Sun, Dec 27, 2009 at 12:44 AM, Jake Mannix <ja...@gmail.com> wrote:

> Sorry, misfire!  I've usually tried to maximize similarity, without ever
> using abs, even on text.  Antiparallel is dissimilar, no?
>
> On Dec 26, 2009 11:12 AM, "Jake Mannix" <ja...@gmail.com> wrote:
>
> I've never treated text any differently, and
>
> > > On Dec 26, 2009 10:54 AM, "Robin Anil" <ro...@gmail.com> wrote: >
> >
> I ran Cosine and tanim...
>

Re: Cosine and Tanimoto Similarity

Posted by Jake Mannix <ja...@gmail.com>.
Sorry, misfire!  I've usually tried to maximize similarity, without ever
using abs, even on text.  Antiparallel is dissimilar, no?

On Dec 26, 2009 11:12 AM, "Jake Mannix" <ja...@gmail.com> wrote:

I've never treated text any differently, and

> > On Dec 26, 2009 10:54 AM, "Robin Anil" <ro...@gmail.com> wrote: > >
I ran Cosine and tanim...

Re: Cosine and Tanimoto Similarity

Posted by Jake Mannix <ja...@gmail.com>.
I've never treated text any differently, and

On Dec 26, 2009 10:54 AM, "Robin Anil" <ro...@gmail.com> wrote:

I ran Cosine and tanimoto distance measure ( d = 1 - similarity measure) on
the following vector pairs

(-1, -1) and (3,3) Cosine : 2.0
Tanimoto: 1.2307692307692308
(1, 1)   and (3,3) Cosine : 0.0
Tanimoto: 0.5714285714285714
(1, 8)   and (8,1) Cosine : 0.7538461538461538
 Tanimoto: 0.8596491228070176

How should anti parallel vectors be treated in MAHOUT clustering packages.
is it acceptable? to return 2.0 for antiparallel vectors and 1.0 for
perpendicular vectors in the case of text data the vectors are positive. If
clustering of scientific data is done, what should be the default
behaviour. Since clustering is always trying find a configuration where the
distances are at the minimum? Since I have dealt mostly with Text data, I
would always try and get the abs value of cosine similarity before
subtracting from 1.0. Has anyone of you encountered a such a situation wrt
some particular dataset?
Robin