You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paulo Magalhaes <pa...@gmail.com> on 2011/07/01 23:37:18 UTC

fuzzy kmeans - all cluster with the same top terms

Hi all,

I believe there is something wrong with fkmeans in trunk.

I am using code from trunk (last checkout 6/30/11). To recreate is very
simple:
1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2
2) run build-reuters.sh
3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt
sequencefile -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o
./reuters-clusterdump.txt  -d
./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0

if you check reuters-clusterdump.txt, you wil notice that all the top terms
are the same as well as the number of documents in the cluster.

It is my first time trying to use it so, there is a good chance I'm doing
something wrong :).
Is it something I should report in the issue tracker ?

Thanks in advance,
Paulo.

RE: fuzzy kmeans - all cluster with the same top terms

Posted by Jeff Eastman <je...@Narus.com>.

I agree there may be something amiss with FuzzyK. If you compare the circa 0.4 wiki photo of running DisplayFuzzyKMeans (https://cwiki.apache.org/confluence/display/MAHOUT/Fuzzy+K-Means) with the current output of that example, you will see that is not generating tight clusters as well now as before on the same data. It could very well be that the distance-to-membership% calculations (computeProbWeight) have gotten bent during some of the refactoring which has occurred in the interim.

I'm looking at that code and don't see anything obvious but more eyeballs would help. The display example is, by default, running an experimental version of the algorithm using the ClusterClassifier which does not really deal with m, so you will need to set the Boolean runClusterer=true to use the regular sequential algorithm. That example uses 2-d vectors on a small field that is easier to debug than the mapreduce version.

Or, it might just be the curse of dimensionality on your data that is causing all the distances to be about equal.

Jeff

-----Original Message-----
From: Jeff Hansen [mailto:dscheffy@gmail.com] 
Sent: Wednesday, August 17, 2011 9:33 AM
To: user@mahout.apache.org
Subject: Re: fuzzy kmeans - all cluster with the same top terms

I'm hitting the same problem.

I'm using movie description data to try clustering movies (the descriptive
text is from freebase.com).  Kmeans was working fine for me, but when I
tried out fuzzy-kmeans (using trunk) I get the same experience as you Paulo.

Here's the parameters I'm passing to MahoutDriver job:
fkmeans -i movies-vectors/tfidf-vectors -o movies-clusters/fkmeans -k 10
--maxIter 10 --clusters clusters -cd 0.1 -m 2 -ow -cl -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure
(I also tried Tanimoto distance with the same results)

I've been running it locally so I can step through the code in eclipse, but
I can't tell if what I'm seeing is normal. In the mapper I notice that the
distances in the clusterDistanceList all tend to come back very very similar
(nearly always 1 for tanimoto and nearly always 1.4 (sqrt of 2) for
Euclidean distance).  My vectors 39311 long (using trigrams with
minloglikelihood of 50) and there all normalized with n=2.

I guess my next step will be to step through the standard kmeans code and
see if the distances come back much different from there.

On Fri, Jul 1, 2011 at 4:37 PM, Paulo Magalhaes
<pa...@gmail.com>wrote:

> Hi all,
>
> I believe there is something wrong with fkmeans in trunk.
>
> I am using code from trunk (last checkout 6/30/11). To recreate is very
> simple:
> 1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2
> 2) run build-reuters.sh
> 3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt
> sequencefile -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o
> ./reuters-clusterdump.txt  -d
> ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0
>
> if you check reuters-clusterdump.txt, you wil notice that all the top terms
> are the same as well as the number of documents in the cluster.
>
> It is my first time trying to use it so, there is a good chance I'm doing
> something wrong :).
> Is it something I should report in the issue tracker ?
>
> Thanks in advance,
> Paulo.
>

Re: fuzzy kmeans - all cluster with the same top terms

Posted by Jeff Hansen <ds...@gmail.com>.

I'm hitting the same problem.

I'm using movie description data to try clustering movies (the descriptive
text is from freebase.com).  Kmeans was working fine for me, but when I
tried out fuzzy-kmeans (using trunk) I get the same experience as you Paulo.

Here's the parameters I'm passing to MahoutDriver job:
fkmeans -i movies-vectors/tfidf-vectors -o movies-clusters/fkmeans -k 10
--maxIter 10 --clusters clusters -cd 0.1 -m 2 -ow -cl -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure
(I also tried Tanimoto distance with the same results)

I've been running it locally so I can step through the code in eclipse, but
I can't tell if what I'm seeing is normal. In the mapper I notice that the
distances in the clusterDistanceList all tend to come back very very similar
(nearly always 1 for tanimoto and nearly always 1.4 (sqrt of 2) for
Euclidean distance).  My vectors 39311 long (using trigrams with
minloglikelihood of 50) and there all normalized with n=2.

I guess my next step will be to step through the standard kmeans code and
see if the distances come back much different from there.

On Fri, Jul 1, 2011 at 4:37 PM, Paulo Magalhaes
<pa...@gmail.com>wrote:

> Hi all,
>
> I believe there is something wrong with fkmeans in trunk.
>
> I am using code from trunk (last checkout 6/30/11). To recreate is very
> simple:
> 1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2
> 2) run build-reuters.sh
> 3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt
> sequencefile -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o
> ./reuters-clusterdump.txt  -d
> ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0
>
> if you check reuters-clusterdump.txt, you wil notice that all the top terms
> are the same as well as the number of documents in the cluster.
>
> It is my first time trying to use it so, there is a good chance I'm doing
> something wrong :).
> Is it something I should report in the issue tracker ?
>
> Thanks in advance,
> Paulo.
>