You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mahout.apache.org by co...@apache.org on 2009/03/18 18:12:00 UTC

[CONF] Apache Lucene Mahout: Dirichlet Process Clustering (page edited)

Dirichlet Process Clustering (MAHOUT) edited by Jeff Eastman
Page: http://cwiki.apache.org/confluence/display/MAHOUT/Dirichlet+Process+Clustering
Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=101992&originalVersion=4&revisedVersion=5

Content:
---------------------------------------------------------------------

h1. Overview

The Dirichlet Process Clustering algorithm performs Bayesian mixture modeling.

The idea is that we use a probabilistic mixture of a number of models that we use to explain some observed data. Each observed data point is assumed to have come from one of the models in the mixture, but we don't know which. The way we deal with that is to use a so-called latent parameter which specifies which model each data point came from.

In addition, since this is a Bayesian clustering algorithm, we don't want to actually commit to any single explanation, but rather to sample from the distribution of models and latent assignments of data points to models given the observed data and the prior distributions of model parameters. This sampling process is initialized by taking models at random from the prior distribution for models.

Then, we iteratively assign points to the different models using the mixture probabilities and the degree of fit between the point and each model expressed as a probability that the point was generated by that model. After points are assigned, new parameters for each model are sampled from the posterior distribution for the model parameters considering all of the observed data points that were assigned to the model. Models without any data points are also sampled, but since they have no points assigned, the new samples are effectively taken from the prior distribution for model parameters.

The result is a number of samples that represent mixing probabilities, models and assignment of points to models. If the total number of possible models is substantially larger than the number that ever have points assigned to them, then this algorithm provides a (nearly) non-parametric clustering algorithm. These samples can give us interesting information that is lacking from a normal clustering that consists of a single assignment of points to clusters. Firstly, by examining the number of models in each sample that actually has any points assigned to it, we can get information about how many models (clusters) that the data support. Morevoer, by examining how often two points are assigned to the same model, we can get an approximate measure of how likely these points are to be explained by the same model. Such soft membership information is difficult to come by with conventional clustering methods.

Finally, we can get an idea of the stability of how the data can be described. Typically, aspects of the data with lots of data available wind up with stable descriptions while at the edges, there are aspects that are phenomena that we can't really commit to a solid description, but it is still clear that the well supported explanations are insufficient to explain these additional aspects. One thing that can be difficult about these samples is that we can't always assign a correlation between the models in the different samples. Probably the best way to do this is to look for overlap in the assignments of data observations to the different models.

h1. Examples

The following images illustrate three different prior models applied to a set of randomly-generated 2-d data points. The points are generated using a normal distribution centered at a mean location and with a constant standard deviation. See the README file in the /examples/src/main/java/org/apache/mahout/clustering/dirichlet/README.txt for details on running similar examples.

The points are generated as follows:

* 40 samples m=\[1.0, 1.0] sd=3.0
* 30 samples m=\[1.0, 0.0] sd=0.5
* 30 samples m=\[0.0, 2.0] sd=0.1

In the first image, the points are plotted and the 3-sigma boundaries of their generator are superimposed. It is, of course, impossible to tell which model actually generated each point as there is some probability - perhaps small - that any of the models could have generated every point.

!DirichletData.png!

In the next image, the Dirichlet Process Clusterer is run against the sample points using a NormalModelDistribution with m=\[0.0, 0.0] sd=1.0. This distribution represents the least amount of prior information, as its sampled models all have constant parameters. The resulting significant models (representing > 5% of the population) are superimposed upon the sample data. Since all prior models are identical and their pdfs are the same, the first iteration's assignment of points to models is completely governed by the initial mixture values. Since these are also identical, it means the first iteration assigns points to models at random. During subsequent iterations, the models diverge from the origin but there is some over-fitting in the result.

!DirichletN.png!

The next image improves upon this situation by using a SampledNormalDistribution. In this distribution, the prior models have means that are sampled from a normal distribution and all have a constant sd=1. This distribution creates initial models that are centered at different coordinates. During the first iteration, each model thus has a different pdf for each point and the iteration assigns points to the more-likely models given this value. The result is a very good capture of the sample data parameters.

!DirichletSN.png!

The next image uses an AsymmetricSampledNormalDistribution in which the model's standard deviation is also represented as a 2-d vector. This causes the clusters to assume elliptical shapes in the resulting clustering. This represents an incorrect prior assumption but it is interesting that it fits the actual sample data quite well. Had we suspected the sample points were generated in a similar manner then this distribution would have been the most logical model.

!DirichletASN.png!

In order to explore an asymmetrical sample data distribution, the following image shows a larger number of points generated according to the following parameters. Again, the generator's 3-sigma ellipses are superimposed:

* 400 samples m=\[1.0, 1.0] sd=\[3.0, 1.0]
* 300 samples m=\[1.0, 0.0] sd=\[0.5, 1.0]
* 300 samples m=\[0.0, 2.0] sd=\[0.1, 0.5]

!2dDirichletData.png!

The following image shows the results of applying the symmetrical SampledNormalDistribution to the asymmetrically-generated sample data. It does a valiant effort but does not capture a very good set of models because the prior assumption is just wrong.

!2dDirichletSN.png!

Finally, the AsymmetricSampledNormalDistribution is run against the asymmetrical sample data. Though there is some over-fitting, it does a credible job of capturing the underlying models. I suspect the model over-fitted because its prior assumption of sd=1 is too low given the std values used to generate the sample data. Of course, this can only be suspected because I know the initial distributions. Without this knowledge we would have to take the clustering analysis at face value. Nonetheless, if we suspected over-fitting we might run the analysis again with a random seed for the Random generator to see what other clusterings were possible from the data. We might like one of those even better than this.

!2dDirichletASN.png!

h1. References

McCullagh and Yang: http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf

There is also a more approachable example in [Chris Bishop's book on Machine Learning| http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm]. I think that chapter 9 is where the example of clustering using a mixture model is found.

The Neal and Blei references from the McCullagh and Yang paper are also good. Zoubin Gharamani has some very [nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool|http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf], there are video versions about as well.

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence