You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sebastien Bratieres <sb...@cam.ac.uk> on 2009/06/08 16:06:58 UTC

Mahout's Dirichlet Process Mixture Model implementation

Dear Mahout developers,

I am planning to contribute to the Dirichlet Process Clustering algorithm
implemented by Jeff (Eastman). I have read through the code in some detail,
and discussed a couple of points with Jeff already in order not to create a
mess. That way I could understand how the code originated and where I could
contribute.

Here I'd like to announce what I would like to do, in order to prompt your
feedback, before I create a JIRA issue and make patches.

Essentially I would like to repair those aspects of the code where the
algorithm is broken
- most importantly, the parameter re-estimation step currently is a maximum
likelihood re-estimation. So the algorithm is not guaranteed to do actual
training/converge/work at all. In order to do Gibbs sampling, proper
Bayesian re-estimation is required.
- the non-parametric aspect is lost by requiring the max number of clusters
and the alpha_0 parameters as inputs to the algorithm; the whole point of
the DPMM is that the number of clusters is not predefined, but given by the
data. I'd like to fix this.
- the normalization procedures for normalizing a vector of probabilities
divide each component by the max of the components, instead of their sum
- generalize probability distributions in dimension and priors, eg the
normal seems limited to a 2D µ=(0,0), sigma=1 prior
- change terminology to "standard" machine learning terms, add a
mathematical description of what's happening in the algorithm

What do you think ?

Thanks
Sebastien

Re: Mahout's Dirichlet Process Mixture Model implementation

Posted by Ted Dunning <te...@gmail.com>.
Fabulous!

Some details in-line.

On Mon, Jun 8, 2009 at 7:06 AM, Sebastien Bratieres <sb...@cam.ac.uk> wrote:

> - most importantly, the parameter re-estimation step currently is a maximum
> likelihood re-estimation. So the algorithm is not guaranteed to do actual
> training/converge/work at all. In order to do Gibbs sampling, proper
> Bayesian re-estimation is required.
>

Thanks.  I had a suspicion about this, but never had time to check in
detail.


>  - the non-parametric aspect is lost by requiring the max number of
> clusters
> and the alpha_0 parameters as inputs to the algorithm; the whole point of
> the DPMM is that the number of clusters is not predefined, but given by the
> data. I'd like to fix this.
>

This is only partially true.  By setting the max number of clusters to a
large number, you get a good approximation to the truly non-parametric
solution.

Setting alpha_0 is more of a hack.  It would be better to sample it to
provide a fully Gibbs' sampler, but this is not all that critical in many
situations since we can also do MAP estimation of good values for alpha_0
and use that.


> - the normalization procedures for normalizing a vector of probabilities
> divide each component by the max of the components, instead of their sum
>

Good point.


>  - generalize probability distributions in dimension and priors, eg the
> normal seems limited to a 2D µ=(0,0), sigma=1 prior
>

Also a good idea.  I had thought that the normal distribution was only an
example.  It should definitely be generalized if not.  Commons math has the
beginnings of a good abstraction for distributions.  I would recommend
working with them.


> - change terminology to "standard" machine learning terms, add a
> mathematical description of what's happening in the algorithm
>

Fabulous.


>  What do you think ?
>

Go for it.

The best next step is to file several JIRA issues and start adding patches.
Mahout moves very fast so good patches will get constructive criticism or be
committed very quickly.



-- 
Ted Dunning, CTO
DeepDyve

Re: Mahout's Dirichlet Process Mixture Model implementation

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
I had a nice Skype chat with Sebastien last week and he has my 
wholehearted support on these improvements. The sampleFromPosterior 
implementation was a total naive hack which was actually an improvement 
over always sampling from the prior. The fact that the current 
implementation converges so nicely in the example gave me a false sense 
of security.

The apriori specification of the number of clusters bothers me less, 
since setting it to a larger number than the expected number of models 
is a reasonable approximation.

Open up some Jiras on these issues and I will try to stay out of your 
critical path,

Jeff

Sebastien Bratieres wrote:
> Dear Mahout developers,
>
> I am planning to contribute to the Dirichlet Process Clustering algorithm
> implemented by Jeff (Eastman). I have read through the code in some detail,
> and discussed a couple of points with Jeff already in order not to create a
> mess. That way I could understand how the code originated and where I could
> contribute.
>
> Here I'd like to announce what I would like to do, in order to prompt your
> feedback, before I create a JIRA issue and make patches.
>
> Essentially I would like to repair those aspects of the code where the
> algorithm is broken
> - most importantly, the parameter re-estimation step currently is a maximum
> likelihood re-estimation. So the algorithm is not guaranteed to do actual
> training/converge/work at all. In order to do Gibbs sampling, proper
> Bayesian re-estimation is required.
> - the non-parametric aspect is lost by requiring the max number of clusters
> and the alpha_0 parameters as inputs to the algorithm; the whole point of
> the DPMM is that the number of clusters is not predefined, but given by the
> data. I'd like to fix this.
> - the normalization procedures for normalizing a vector of probabilities
> divide each component by the max of the components, instead of their sum
> - generalize probability distributions in dimension and priors, eg the
> normal seems limited to a 2D µ=(0,0), sigma=1 prior
> - change terminology to "standard" machine learning terms, add a
> mathematical description of what's happening in the algorithm
>
> What do you think ?
>
> Thanks
> Sebastien
>
>