You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vasil Vasilev <va...@gmail.com> on 2011/01/04 09:56:26 UTC

Clustering with Mahout

Hello,

I started getting to know with Apache Mahout clustering by running the quick
start guide. I ran the Dirichlet clustering algorithm over the synthetic
control data that are available in the example, but the results are not
quite satisfactory. I noticed that the number of clusters is approximately
correctly estimated, but the data are very mixed and the clusters are not
well separated.
What could be the reason for that? Does the proposed vector representation
really lead to Gaussian distributed points?

I further proceeded with investigating this issue with putting a little bit
more semantics. I produced 3-dimensional Vectors from the data with the
following characteristics:
dimension 1: the average angle of synthetic control signal line. It was
estimated by running linear regression for each signal
dimension 2: The number of times the synthetic control line crosses an
average straight line
dimension 3: The largest shift of values in any direction (up and down)

When I ran the algorithm with these parameters I noticed that now data are
classified correctly using L1 Distance measure (this measure turned out the
give best results). However the algorithm becomes parametric, i.e. alpha
parameter and the initial number of clusters highly affects the final
results. In addition using Gaussian Clusters with this approach leads to bad
results.

Finally I found that first one should run the example as it is in order to
get an overview of the number of clusters and then put some semantics in
order to produce the correct clusters using some distance measure technique.

In this respect my question is how should one approach to a new clustering
problem? May be you could recommend me some stuff to read.

In addition, during experimenting, I noticed several problems. If they are
bugs I could report them:

1. AsymmetricSampledNormalModel does not work correctly. On line 125 it is
enough only one of the probabilities to be 0, which makes the whole
probability 0. The last thing happens because calculation of the exponent
for very small numbers returns 0. The last thing happens when the standard
deviation is too small and the distance in much higher. To fix this: isn't
it better to take the initial sd-s in AsymmetricSampledNormalDistribution
(sampleFromPrior method) based on the data (for example max data value). In
the GaussianCluster for example the pdf is calculated different way - by
summing the probabilities

2. CosineDistanceMeasure does not work correctly, because the initial
clusters are with 0 centers and it is impossible to determine the angle
between 0 vector and another vector.

3. MahalanobisDistanceMeasure cannot be used with Dirichlet clusterer,
because the configure method is not called.

Regards, Vasil