You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Bogdan Vatkov <bo...@gmail.com> on 2010/01/06 23:26:04 UTC

Cluster distance

What is the practical meaning of the "cluster distance" e.g. I am currently
using org.apache.mahout.common.distance.CosineDistanceMeasure but I do not
have any clue what does that mean and what other values could bring to the
game. Any guidance here?

-- 
Best regards,
Bogdan

Re: Cluster distance

Posted by Felix Lange <fx...@googlemail.com>.

Take for example the following two measures. The first takes the distance
between centers as the cluster distance, and the second takes the minimal
pairwise distance between elements of two clusters as the cluster distance.
If you use the first measure, it could be the case that two clusters almost
overlap. so if you have two clusters of strings {xAAAA ,yAAA,zAAA  } and
{xBBB,uBBB,uBBB} (where the string distance measure is based on identity of
letters), the distance between the mean of the two clusters might be big,
because the three-letter-block in each string differs between the two
clusters. but the strings xAAA and xBBB are quite similar, because they
share a letter (which is also at the same position in each string). So the
minimal distance between elements of those clusters could be quite small,
something you maybe want to avoid. So this could be an example in which you
choose the second distance measure and impose a minimal value as a threshold
for the clustering.

fx

2010/1/7 Sean Owen <sr...@gmail.com>

> Yes, I mean, it's possible to explain what each of the algorithms
> does, both formally and intuitively. But it's hard to explain in which
> cases one metric might be more desirable than another. Yes, in a sense
> they all should do the same thing -- define a consistent distance
> metric between vectors. But they're different metrics.
>
> Even I don't try to think too hard about which one is best. I just try
> them all when trying to fit the best algorithm to a set of data. So
> it's still good to have different implementations.
>
> On Thu, Jan 7, 2010 at 12:51 PM, Bogdan Vatkov <bo...@gmail.com>
> wrote:
> > I see but I was looking for more practical definition - e.g. if I use one
> or
> > another distance measure class what would be the effect.
> > The mathematical explanations in the javadoc are not helping much.
> > If there is no way to explain different algorithms for distance in more
> > practical way then maybe we do not need different algorithms :)
> > - e.g. is the distance affected more by the number of common terms or the
> > weights of common terms or ... - this is just a possible example, I do
> not
> > know if it matches any of the distance algorithms.
> > there should be a guidance for the ones that will use the stuff - it is
> > expected that these users know something about their input data and based
> on
> > different characteristics of that data (e.g. number of docs, doc size,
> etc.)
> > and desired result (e.g. number of clusters, number of unique term in
> > clusters, etc.) to be able to pick the right Mahout configuration - with
> > regards to numbers, classes, algorithms, etc.
> > I currently miss such a guideline.
>

Re: Cluster distance

Posted by Sean Owen <sr...@gmail.com>.

Yes, I mean, it's possible to explain what each of the algorithms
does, both formally and intuitively. But it's hard to explain in which
cases one metric might be more desirable than another. Yes, in a sense
they all should do the same thing -- define a consistent distance
metric between vectors. But they're different metrics.

Even I don't try to think too hard about which one is best. I just try
them all when trying to fit the best algorithm to a set of data. So
it's still good to have different implementations.

On Thu, Jan 7, 2010 at 12:51 PM, Bogdan Vatkov <bo...@gmail.com> wrote:
> I see but I was looking for more practical definition - e.g. if I use one or
> another distance measure class what would be the effect.
> The mathematical explanations in the javadoc are not helping much.
> If there is no way to explain different algorithms for distance in more
> practical way then maybe we do not need different algorithms :)
> - e.g. is the distance affected more by the number of common terms or the
> weights of common terms or ... - this is just a possible example, I do not
> know if it matches any of the distance algorithms.
> there should be a guidance for the ones that will use the stuff - it is
> expected that these users know something about their input data and based on
> different characteristics of that data (e.g. number of docs, doc size, etc.)
> and desired result (e.g. number of clusters, number of unique term in
> clusters, etc.) to be able to pick the right Mahout configuration - with
> regards to numbers, classes, algorithms, etc.
> I currently miss such a guideline.

Re: Cluster distance

Posted by Ted Dunning <te...@gmail.com>.

The best rule is to try several cases.  L-1 and L-2 with or without
normalization are the most important cases.

The k-means clustering assumes that you have already done any term
weighting.  You should experiment a little bit there as well, but the
standard IDF measure is probably fine.  The only question is whether you
should limit the weight of singleton terms somewhat.  With large corpora,
that is less critical.  Also, if you don't use L-2 normalization, then what
you do with very rare terms will matter much less since they probably won't
ever match with anything and thus won't contribute to dot products.

On Thu, Jan 7, 2010 at 5:20 AM, Grant Ingersoll <gs...@apache.org> wrote:

> I'm sure others can chime in w/ more of their experience.




-- 
Ted Dunning, CTO
DeepDyve

Re: Cluster distance

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 7, 2010, at 7:51 AM, Bogdan Vatkov wrote:

> I see but I was looking for more practical definition - e.g. if I use one or
> another distance measure class what would be the effect.
> The mathematical explanations in the javadoc are not helping much.
> If there is no way to explain different algorithms for distance in more
> practical way then maybe we do not need different algorithms :)
> - e.g. is the distance affected more by the number of common terms or the
> weights of common terms or ... - this is just a possible example, I do not
> know if it matches any of the distance algorithms.
> there should be a guidance for the ones that will use the stuff - it is
> expected that these users know something about their input data and based on
> different characteristics of that data (e.g. number of docs, doc size, etc.)
> and desired result (e.g. number of clusters, number of unique term in
> clusters, etc.) to be able to pick the right Mahout configuration - with
> regards to numbers, classes, algorithms, etc.
> I currently miss such a guideline.

Typically, it is the case that the source of the data dictates the measures, etc.  AIUI, text is best represented by using a 1 or 2-norm and then using the appropriate distance measure (Manhattan, Euclidean or Cosine).  Some of the other measures are best suited for other kinds of data, but I don't have a good sense for them yet.  I've gotten decent to good results (not formally validated) on news text using Cosine and vectors normalized using the 2-norm.   As Ted said to me the other day, though, K-Means, in particular, is fairly robust even if you aren't strict about matching the normalization w/ the distance measure.

I do think, though, that a lot of it comes down to trial and error with your data.  We are working on some scripts to make this a lot easier.  One of the things that we need to do is build up benchmarks w/ common collections (see the Open Relevance Project under Lucene) so that people can make comparisons and see how this all works.

I'm sure others can chime in w/ more of their experience.

> 
> On Thu, Jan 7, 2010 at 2:38 PM, Felix Lange <fx...@googlemail.com> wrote:
> 
>> Hi Bodgan,
>> I didn't read any javadocs about this package, but the cluster distance
>> should be the distance between two clusters. There are different distance
>> measures in this respect, e.g. you can take the distance between two
>> clusters' centers as their distance value.
>> Greetings
>> Felix
>> 
>> 
>> 2010/1/6 Bogdan Vatkov <bo...@gmail.com>
>> 
>>> What is the practical meaning of the "cluster distance" e.g. I am
>> currently
>>> using org.apache.mahout.common.distance.CosineDistanceMeasure but I do
>> not
>>> have any clue what does that mean and what other values could bring to
>> the
>>> game. Any guidance here?
>>> 
>>> --
>>> Best regards,
>>> Bogdan
>>> 
>> 
> 
> 
> 
> -- 
> Best regards,
> Bogdan

Re: Cluster distance

Posted by Bogdan Vatkov <bo...@gmail.com>.

I see but I was looking for more practical definition - e.g. if I use one or
another distance measure class what would be the effect.
The mathematical explanations in the javadoc are not helping much.
If there is no way to explain different algorithms for distance in more
practical way then maybe we do not need different algorithms :)
- e.g. is the distance affected more by the number of common terms or the
weights of common terms or ... - this is just a possible example, I do not
know if it matches any of the distance algorithms.
there should be a guidance for the ones that will use the stuff - it is
expected that these users know something about their input data and based on
different characteristics of that data (e.g. number of docs, doc size, etc.)
and desired result (e.g. number of clusters, number of unique term in
clusters, etc.) to be able to pick the right Mahout configuration - with
regards to numbers, classes, algorithms, etc.
I currently miss such a guideline.

On Thu, Jan 7, 2010 at 2:38 PM, Felix Lange <fx...@googlemail.com> wrote:

> Hi Bodgan,
> I didn't read any javadocs about this package, but the cluster distance
> should be the distance between two clusters. There are different distance
> measures in this respect, e.g. you can take the distance between two
> clusters' centers as their distance value.
> Greetings
> Felix
>
>
> 2010/1/6 Bogdan Vatkov <bo...@gmail.com>
>
> > What is the practical meaning of the "cluster distance" e.g. I am
> currently
> > using org.apache.mahout.common.distance.CosineDistanceMeasure but I do
> not
> > have any clue what does that mean and what other values could bring to
> the
> > game. Any guidance here?
> >
> > --
> > Best regards,
> > Bogdan
> >
>



-- 
Best regards,
Bogdan

Re: Cluster distance

Posted by Felix Lange <fx...@googlemail.com>.

Hi Bodgan,
I didn't read any javadocs about this package, but the cluster distance
should be the distance between two clusters. There are different distance
measures in this respect, e.g. you can take the distance between two
clusters' centers as their distance value.
Greetings
Felix


2010/1/6 Bogdan Vatkov <bo...@gmail.com>

> What is the practical meaning of the "cluster distance" e.g. I am currently
> using org.apache.mahout.common.distance.CosineDistanceMeasure but I do not
> have any clue what does that mean and what other values could bring to the
> game. Any guidance here?
>
> --
> Best regards,
> Bogdan
>