You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Isabel Drost (JIRA)" <ji...@apache.org> on 2008/04/08 16:19:25 UTC

[jira] Created: (MAHOUT-30) dirichlet process implementation

dirichlet process implementation
--------------------------------

                 Key: MAHOUT-30
                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
             Project: Mahout
          Issue Type: New Feature
          Components: Clustering
            Reporter: Isabel Drost



Copied over from original issue:

> Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: MAHOUT-30.patch

Here's a work-in-progress Dirichlet Process Clustering algorithm that Ted Dunning has been coaching me to write. It originated from an R prototype which he wrote and which I translated into Java using our nascent Vector package. Then Ted took the Java and refactored it to introduce better abstractions, as the R implementation had no objects and my reverse-engineered abstractions were rather clunky. Finally, I got the new implementation working with a pluggable distributions framework based either upon commons-math (+ Ted's patches thereto) or the blog-0.2 framework (vanilla).

I am posting this to the list in hopes of generating more interest from the larger Mahout community. It has taken me literally months to wrap my mind around this approach. Enjoy.

To run this patch you will need to get the blog package at http://people.csail.mit.edu/milch/blog. Here is the beginning of the README file that came with the distribution:
=====
Bayesian Logic (BLOG) Inference Engine version 0.2

Copyright (c) 2007, Massachusetts Institute of Technology
Copyright (c) 2005, 2006, Regents of the University of California
All rights reserved.  This software is distributed under the license 
included in LICENSE.txt.

Lead author: Brian Milch, milch@csail.mit.edu
Supervisors: Prof. Stuart Russell (Berkeley), Prof. Leslie Kaelbling (MIT)
Contributors: Bhaskara Marthi, Andrey Kolobov, David Sontag, Daniel L. Ong,
    Brendan Clark
=====

Jeff

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>         Attachments: MAHOUT-30.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646790#action_12646790 ] 

Jeff Eastman commented on MAHOUT-30:
------------------------------------

I did some refactoring to better localize the major processing steps in DirichletCluster. The first refactoring was to create an iterate method that handles each iteration after the initialization phase initializes the models, Dirichlet and mixture. Then I factored out the three sub-processing steps: assignPointsToModels, recomputeModels and computeNewMixtureParameters and moved some of the instance variables to locals. Looking at the result it is starting to feel more like the other clustering algorithms so I'm starting to think in terms of an M/R partitioning. 

* In assignPointsToModels, we are iterating over all of the points, calculating their pdf with respect to the current clusters (models), applying the mixture probabilities and then using a multinomial sampling to assign them to a single cluster for this iteration. In a subsequent iteration we might assign them to different models, depending upon the probabilities.
* In recomputeModels, we use the model assignments from this iteration to compute the posterior samples for each of the models. Then we generate a new model that best describes this sub-sample. The ah-ha for me here is that we create entirely new models in each iteration.
* In iterate, we periodically output the models so we can gain whatever wisdom we wish from their parameters. I didn't factor that out but may later.
* in computeNewMixtureParameters we use the same model assignments to compute a new alpha vector for the Dirichlet distribution and a new mixture vector too. Then we iterate over the points again, continually refining these values. The ah-ha for me here is we do not actually remember the model assignments. _I think, after the models have all been computed, we could run another pass over the points computing the pdf for each cluster as a probability vector. We do this in Kmeans too._

The first step seems very similar to the KmeansMapper: we read the Dirichlet, mixture and the current models into each mapper, then compute the multinomial and output the point keyed by its model index to the reducer. In a single reducer, all of the points are seen so we can calculate the posterior statistics if we switch the model to use a running sums (s0, s1, s2) algorithm for the std. Following this approach, it might be we could use a combiner to maintain the running sums (assuming Hadoop only runs it once for now) thus reducing the heavy load on the poor reducer. With a single reducer, I'm pretty sure we can compute the new mixture and Dirichlet parameters. With multiple reducers I think those calculations would need help combing the reducer values in the driver.

Ted, I know you've thought about this a lot, do these steps sound reasonable?


{code:title=DirichletCluster.java}
 /**
   * Perform one iteration of the clustering process, updating the Dirichlet distribution
   * and returning the new mixtures
   * 
   * @param models a List<Model<Vector>> of models
   * @param dir a DirichletDistribution that will be modified in each iteration
   * @param mixture a Vector mixture 
   * @param clusterSamples a List<List<Model<Vector>>> that will be modified in each iteration
   * @param iteration the int iteration number
   * @return a new mixture Vector
   */
  private Vector iterate(List<Model<Vector>> models, DirichletDistribution dir,
      Vector mixture, List<List<Model<Vector>>> clusterSamples, int iteration) {
    // z holds the assignment of cluster numbers to points for this iteration
    Vector z = assignPointsToModels(models, mixture);
    models = recomputeModels(z);

    // periodically add models to cluster samples after getting started
    if ((iteration > burnin) && (iteration % thin == 0))
      clusterSamples.add(models);

    // compute and return the new mixture parameters (adjusting 
    // the Dirichlet dir's alpha vector in the process
    return computeNewMixtureParameters(dir, z);
  }

  /**
   * Assign all points to a model based upon the current mixture parameters and
   * the probability that each point is described by each model's pdf.
   * 
   * @param models the current List<Model<Vector>> of models
   * @param mixture the Vector of mixture probabilities
   * @return z the Vector of assignment of points to models
   */
  private Vector assignPointsToModels(List<Model<Vector>> models, Vector mixture) {
    // z holds the assignment of cluster numbers to points for this iteration
    Vector z = new DenseVector(sampleData.size());

    Vector pi = new DenseVector(maxClusters);
    for (int ix = 0; ix < sampleData.size(); i++) {
      Vector x = sampleData.get(ix);
      // compute probabilities for each cluster, saving the largest
      double max = 0;
      for (int k = 0; k < maxClusters; k++) {
        double p = mixture.get(k) * models.get(k).pdf(x);
        pi.set(k, p);
        if (max < p)
          max = p;
      }
      // normalize the probabilities by largest observed value
      pi.assign(new TimesFunction(), 1.0 / max);
      // then pick one cluster by sampling a multinomial based upon the probabilities
      int model = dist.rmultinom(pi);
      z.set(i, model);
    }
    return z;
  }

  /**
   * Recompute new models based upon the assignment of points to models and return them
   * 
   * @param z the Vector of assignment of points to models
   * @return the List<Model<Vector>> of models
   */
  private List<Model<Vector>> recomputeModels(Vector z) {
    List<Model<Vector>> newModels = new ArrayList<Model<Vector>>();
    for (int k = 0; k < maxClusters; k++) {
      // collect all data assigned to each cluster
      List<Vector> data = new ArrayList<Vector>();
      for (int i = 0; i < sampleData.size(); i++)
        if (z.get(i) == k)
          data.add(sampleData.get(i));
      // add a new posterior model if any data else use a prior model
      if (data.size() > 0)
        newModels.add(modelFactory.sampleFromPosterior(data));
      else
        newModels.add(modelFactory.sampleFromPrior());
    }
    return newModels;
  }

  /**
   * Compute a new mixture based upon the Dirichlet distribution and the assignment of points to models.
   * 
   * @param dir the DirichletDistribution, which is modified during this process
   * @param z the Vector of assignment of points to models
   * @return the new mixture Vector
   */
  private Vector computeNewMixtureParameters(DirichletDistribution dir, Vector z) {
    Vector mixture;
    Vector a = new DenseVector(maxClusters);
    a.assign(alpha_0 / maxClusters);
    for (int i = 0; i < z.size(); i++) {
      int z_i = (int) z.get(i);
      a.set(z_i, a.get(z_i) + 1);
    }
    dir.setAlpha(a);
    mixture = dir.sample();
    return mixture;
  }
{code}

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>         Attachments: MAHOUT-30.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: MAHOUT-30f.patch

Final patch file is ready to commit. Need to add entry to pom for gson jar which is currently in core/lib.

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: dirichlet-2.tar, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch, MAHOUT-30f.patch, screenshot-1.jpg
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: jeastman.vcf

Hi Isabel,

I'm so happy you had time to look through the code. Getting it to this 
point was a great ordeal for me as the math is complicated and I have no 
formal statistics background. Ted's help was critical in getting me to 
the tipping point where I now understand the implementation well enough 
to make progress on my own. I'm getting ready for a week vacation and 
will not have email but would love to continue this dialog and am very 
open to your suggestions below. See more comments therein.

Jeff

I was thinking of moving the display code into the examples directory.
I did that so Ted could use his favorite library but he has not been 
pursuing it. I'm happy with blog and, as commons does not have the 
needed sampling methods without Ted's patches, suggest we could go with 
blog. Removing the plugability would clean up the code some too.
? Does this relate to maven?
Boy, I would too. Especially if it was clear enough that I could 
understand it :)
Some of those were terms Ted introduced from my original port of his R 
example. I'm not hung up but perhaps we should include him in the 
discussion?



> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman reassigned MAHOUT-30:
----------------------------------

    Assignee: Jeff Eastman

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: screenshot-1.jpg

    generateSamples(500, 0, 0, 0.5);
    generateSamples(500, 2, 0, 0.2);
    generateSamples(500, 0, 2, 0.3);
    generateSamples(500, 2, 2, 1);
    DirichletDriver.runJob("input", "output",
        "org.apache.mahout.clustering.dirichlet.SampledNormalDistribution", 20, 15, 1.0, 2);


> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: dirichlet-2.tar, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch, screenshot-1.jpg
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment:     (was: dirichlet.tar)

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: dirichlet-2.tar, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659273#action_12659273 ] 

Ted Dunning commented on MAHOUT-30:
-----------------------------------


Some references include a relatively dense article by McCullagh and Yang:

http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf

There is also a more approachable example in Chris Bishop's book on Machine Learning.  See http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm.  I think that chapter 9 is where the example of clustering using a mixture model is found.

The Neal and Blei references from the McCullagh and Yang paper are also good.  Zoubin Gharamani has some very nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool.  One is at http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf but here are video versions about as well.

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659283#action_12659283 ] 

Isabel Drost commented on MAHOUT-30:
------------------------------------

Ted: I am moving the references over into the wiki page. Hope that is fine with you?

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647880#action_12647880 ] 

Jeff Eastman commented on MAHOUT-30:
------------------------------------

The above patch makes several improvements to the above:
* refactored state updating and cluster sampling into DirichletState
* refactored creating list of models into ModelDistribution
* refactored state parameters from DirichletCluster to DirichletState
* refactored count into the model
* changed list<Model> to Model[]
* added significance filtering to print out
* increased number of iterations to 30 to demonstrate better convergence

The algorithm now produces the following output when run over 10,000 points:
* Using fixed random seed for repeatability.
* testDirichletCluster10000
* Generating 4000 samples m=[1.0, 1.0] sd=3.0
* Generating 3000 samples m=[1.0, 0.0] sd=0.1
* Generating 3000 samples m=[0.0, 1.0] sd=0.1
* sample[0]= normal(n=4037 m=[0.80, 0.73] sd=1.40), normal(n=3844 m=[0.51, 0.51] sd=0.68), normal(n=1092 m=[0.51, 0.47] sd=0.53), normal(n=794 m=[1.26, 1.60] sd=2.22), 
* sample[1]= normal(n=4562 m=[0.72, 0.68] sd=1.25), normal(n=2992 m=[0.48, 0.52] sd=0.58), normal(n=1022 m=[0.67, 0.31] sd=0.53), normal(n=1227 m=[1.17, 1.41] sd=2.13), 
* sample[2]= normal(n=4377 m=[0.66, 0.61] sd=1.08), normal(n=2592 m=[0.28, 0.71] sd=0.51), normal(n=1057 m=[1.04, -0.06] sd=0.25), normal(n=1831 m=[1.15, 1.26] sd=2.05), 
* sample[3]= normal(n=4302 m=[0.74, 0.36] sd=0.80), normal(n=2075 m=[-0.00, 1.01] sd=0.32), normal(n=793 m=[1.04, -0.05] sd=0.20), normal(n=2694 m=[1.04, 1.17] sd=1.93), 
* sample[4]= normal(n=3602 m=[0.80, 0.21] sd=0.58), normal(n=1923 m=[-0.05, 1.05] sd=0.26), normal(n=621 m=[1.03, -0.06] sd=0.19), normal(n=3677 m=[0.94, 1.09] sd=1.77), 


> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: MAHOUT-30d.patch

This patch moves the display-related classes into the examples subtree and removes the dependency upon commons.math which needs some extra patches to the release and is TBD. It still depends upon blog-0.3 which I have only been able to find in source format. I will try to locate or build a suitable jar for that. I believe, since it is under BSD license we are ok but I'd like confirmation. The blog site is at http://people.csail.mit.edu/milch/blog/software.html.

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Ted,

Indeed, it was precisely all that sampling that confounded me for so 
long, especially in untyped R. All the other clustering algorithms can 
be thought of as sampling too, but their pdfs ==1 for the model chosen. 
I think it was actually a conversation with a statistics guy at Yahoo! 
when I gave a Mahout intro last summer that got me thinking outside of 
that box. He noted that, for large data sets, it is really not necessary 
to process all the points to get meaningful clusters; just to sample 
from them. That took a few months to really sink in and then the ah-ha 
happened. I think I did a posting to this list at that point. Your 
refactoring of my initial abstractions cemented the deal :).

If I continue down the path of using running sums to compute the new 
model parameters, I think I can eliminate materializing the set of 
points that are assigned to each model in recomputeModels(). I need to 
add an observe() method to the Model interface and do some more 
refactoring of ModelDistribution and it is all a little half-baked right 
now. I'll post that to Jira if I get it working, but the basic idea 
would be to create a new set of prior models in the 
assignPointsToModels() method, ask the assigned model to observe() as I 
iterate through the points, and then just compute posterior parameters 
in recomputeModels. Of course, I'll have to figure out how to compute 
the new mixtures differently, without z, but I have some ideas.

I'll keep you all posted,
Jeff

Ted Dunning (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646967#action_12646967 ] 
>
> Ted Dunning commented on MAHOUT-30:
> -----------------------------------
>
> Jeff,
>
> These look like really nice refactorings.  The process is nice and clear.
>
> The only key trick that may confuse people is that each step is a sampling.  Thus assignment to clusters does NOT assign to the best cluster, it picks a cluster at random, biased by the mixture parameters and model pdf's.  Likewise, model computation does NOT compute the best model, it samples from the distribution given by the data.  Same is true for the mixture parameters.
>
> Your code does this.  I just think that this is a hard point for people to understand in these techniques. 
>
>   
>> dirichlet process implementation
>> --------------------------------
>>
>>                 Key: MAHOUT-30
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Clustering
>>            Reporter: Isabel Drost
>>         Attachments: MAHOUT-30.patch
>>
>>
>> Copied over from original issue:
>>     
>>> Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
>>> non-parametric clustering algorithm.
>>>       
>
>

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646967#action_12646967 ] 

Ted Dunning commented on MAHOUT-30:
-----------------------------------

Jeff,

These look like really nice refactorings.  The process is nice and clear.

The only key trick that may confuse people is that each step is a sampling.  Thus assignment to clusters does NOT assign to the best cluster, it picks a cluster at random, biased by the mixture parameters and model pdf's.  Likewise, model computation does NOT compute the best model, it samples from the distribution given by the data.  Same is true for the mixture parameters.

Your code does this.  I just think that this is a hard point for people to understand in these techniques. 

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>         Attachments: MAHOUT-30.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647186#action_12647186 ] 

Jeff Eastman commented on MAHOUT-30:
------------------------------------

I refactored again and was able eliminate materializing of the posterior {{data}} sets by adding {{observe()}} and {{computeParameters()}} operations to {{Model}}. The idea is that all models begin in their prior state and are asked to observe each sample that is assigned to them. Then, before {{pdf()}} is called on them in the next iteration a call to {{computeParameters()}} finalizes the parameters once and turns the model into a posterior model. I also compute {{counts}} on the fly to eliminate materializing {{z}} altogether. I hope I didn't throw the baby out with the bath water.

Finally, I introduced a {{DirichletState}} bean to hold the models, dirichlet distribution and the mixture, simplifying the arguments and, I think, fixing a bug in the earlier refactoring. The algorithm runs over 10,000 points and produces the following outputs (prior() indicates a model with no observations, n is the number of observations, m the mean and sd the std):

Generating 4000 samples m=[1.0, 1.0] sd=3.0
Generating 3000 samples m=[1.0, 0.0] sd=0.1
Generating 3000 samples m=[0.0, 1.0] sd=0.1

* sample[0]= [prior(), normal(n=6604 m=[0.67, 0.63] sd=1.11), normal(n=86 m=[0.77, 2.81] sd=2.15), prior(), normal(n=242 m=[2.89, 1.67] sd=2.14), normal(n=2532 m=[0.53, 0.55] sd=0.69), normal(n=339 m=[0.99, 1.70] sd=2.18), normal(n=77 m=[0.53, 0.47] sd=0.51), normal(n=119 m=[0.36, 0.47] sd=2.85), normal(n=1 m=[0.00, 0.00] sd=0.33)]
* sample[1]= [prior(), normal(n=6626 m=[0.62, 0.54] sd=0.91), normal(n=137 m=[0.51, 2.99] sd=1.56), normal(n=2 m=[0.57, 0.25] sd=0.70), normal(n=506 m=[2.55, 0.93] sd=1.73), normal(n=1573 m=[0.38, 0.60] sd=0.50), normal(n=848 m=[0.81, 1.59] sd=2.11), normal(n=67 m=[0.76, 0.31] sd=0.45), normal(n=240 m=[0.73, 0.31] sd=2.24), normal(n=1 m=[0.00, 0.00] sd=0.98)]
* sample[2]= [prior(), normal(n=5842 m=[0.67, 0.39] sd=0.73), normal(n=157 m=[0.73, 3.12] sd=1.14), prior(), normal(n=655 m=[2.32, 0.64] sd=1.60), normal(n=1439 m=[0.00, 1.00] sd=0.33), normal(n=1439 m=[0.78, 1.53] sd=1.89), normal(n=66 m=[0.96, -0.04] sd=0.24), normal(n=399 m=[0.63, -0.03] sd=1.99), normal(n=3 m=[-0.07, 0.76] sd=0.41)]

{code:title=Model}
/**
 * A model is a probability distribution over observed data points and allows 
 * the probability of any data point to be computed.
 */
public interface Model<Observation> {
  
  /**
   * Observe the given observation, retaining information about it
   * 
   * @param x an Observation from the posterior
   */
  public abstract void observe(Observation x);
  
  /**
   * Compute a new set of posterior parameters based upon the Observations 
   * that have been observed since my creation
   */
  public abstract void computeParameters();

  /**
  * Return the probability that the observation is described by this model
  * 
  * @param x an Observation from the posterior
  * @return the probability that x is in z
  */
  public abstract double pdf(Observation x);
}
{code}

{code:title=DirichletCluster}
  /**
   * Initialize the variables and run the iterations to assign the sample data
   * points to a computed number of clusters
   *
   * @return a List<List<Model<Observation>>> of the observed models
   */
  public List<List<Model<Observation>>> dirichletCluster() {
    DirichletState<Observation> state = initializeState();

    // create a posterior sample list to collect results
    List<List<Model<Observation>>> clusterSamples = new ArrayList<List<Model<Observation>>>();

    // now iterate
    for (int iteration = 0; iteration < maxIterations; iteration++)
      iterate(state, iteration, clusterSamples);

    return clusterSamples;
  }

  /**
   * Initialize the state of the computation
   * 
   * @return the DirichletState
   */
  private DirichletState<Observation> initializeState() {
    // get initial prior models
    List<Model<Observation>> models = createPriorModels();
    // create the initial distribution.
    DirichletDistribution distribution = new DirichletDistribution(maxClusters,
        alpha_0, dist);
    // mixture parameters are sampled from the Dirichlet distribution. 
    Vector mixture = distribution.sample();
    return new DirichletState<Observation>(models, distribution, mixture);
  }

  /**
   * Create a list of prior models
   * @return the Observation
   */
  private List<Model<Observation>> createPriorModels() {
    List<Model<Observation>> models = new ArrayList<Model<Observation>>();
    for (int k = 0; k < maxClusters; k++) {
      models.add(modelFactory.sampleFromPrior());
    }
    return models;
  }

  /**
   * Perform one iteration of the clustering process, updating the state for the next iteration
   * @param state the DirichletState<Observation> of this iteration
   * @param iteration the int iteration number
   * @param clusterSamples a List<List<Model<Observation>>> that will be modified in each iteration
   */
  private void iterate(DirichletState<Observation> state, int iteration,
      List<List<Model<Observation>>> clusterSamples) {

    // create new prior models
    List<Model<Observation>> newModels = createPriorModels();

    // initialize vector of membership counts
    Vector counts = new DenseVector(maxClusters);
    counts.assign(alpha_0 / maxClusters);

    // iterate over the samples
    for (int i = 0; i < sampleData.size(); i++) {
      Observation x = sampleData.get(i);
      // compute vector of probabilities x is described by each model
      Vector pi = computeProbabilities(state, x);
      // then pick one cluster by sampling a Multinomial distribution based upon them
      // see: http://en.wikipedia.org/wiki/Multinomial_distribution
      int model = dist.rmultinom(pi);
      // ask the selected model to observe the datum
      newModels.get(model).observe(x);
      // record counts for the model
      counts.set(model, counts.get(model) + 1);
    }

    // compute new model parameters based upon observations
    for (int k = 0; k < maxClusters; k++)
      newModels.get(k).computeParameters();

    // update the state from the new models and counts
    state.distribution.setAlpha(counts);
    state.mixture = state.distribution.sample();
    state.models = newModels;

    // periodically add models to cluster samples after getting started
    if ((iteration > burnin) && (iteration % thin == 0))
      clusterSamples.add(state.models);
  }

  /**
   * Compute a normalized vector of probabilities that x is described
   * by each model using the mixture and the model pdfs
   * 
   * @param state the DirichletState<Observation> of this iteration
   * @param x an Observation
   * @return the Vector of probabilities
   */
  private Vector computeProbabilities(DirichletState<Observation> state,
      Observation x) {
    Vector pi = new DenseVector(maxClusters);
    double max = 0;
    for (int k = 0; k < maxClusters; k++) {
      double p = state.mixture.get(k) * state.models.get(k).pdf(x);
      pi.set(k, p);
      if (max < p)
        max = p;
    }
    // normalize the probabilities by largest observed value
    pi.assign(new TimesFunction(), 1.0 / max);
    return pi;
  }
{code}


> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>         Attachments: MAHOUT-30.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment:     (was: jeastman.vcf)

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-30) dirichlet process implementation

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659273#action_12659273 ] 

tdunning edited comment on MAHOUT-30 at 12/26/08 1:57 PM:
-------------------------------------------------------------

One good reference is this relatively dense article by McCullagh and Yang:

http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf

There is also a more approachable example in Chris Bishop's book on Machine Learning.  See http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm.  I think that chapter 9 is where the example of clustering using a mixture model is found.

The Neal and Blei references from the McCullagh and Yang paper are also good.  Zoubin Gharamani has some very nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool.  One is at http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf but here are video versions about as well.

      was (Author: tdunning):
    
Some references include a relatively dense article by McCullagh and Yang:

http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf

There is also a more approachable example in Chris Bishop's book on Machine Learning.  See http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm.  I think that chapter 9 is where the example of clustering using a mixture model is found.

The Neal and Blei references from the McCullagh and Yang paper are also good.  Zoubin Gharamani has some very nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool.  One is at http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf but here are video versions about as well.
  
> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: MAHOUT-30e.patch

Here's the patch file 

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659260#action_12659260 ] 

Ted Dunning commented on MAHOUT-30:
-----------------------------------


Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is something you can sample so the the Distribution terminology would be most compatible that way.  The fact that only one method is currently defined is likely a temporary thing ... other methods could well be required for later efforts.


On Fri, Dec 26, 2008 at 10:53 AM, Jeff Eastman (JIRA) <ji...@apache.org> wrote:

    ... Some of those were terms Ted introduced from my original port of his R
    example. I'm not hung up but perhaps we should include him in the
    discussion?


> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: dirichlet-2.tar

- fixed bug in rBeta where arguments to rGamma were backwards
- fixed model std calculations to handle single sample case (s0=1 => std==0.0 which made resulting pdf==NaN and horked all subsequent cluster assignments)
- added cluster history display to examples with colors to illustrate cluster convergence

This is behaving pretty nicely now

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: dirichlet-2.tar, dirichlet.tar, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed revision 754797. Committed code was refactored slightly from last patch (models moved from test to core), and pom was updated to reflect Gson dependency.

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: dirichlet-2.tar, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch, MAHOUT-30f.patch, screenshot-1.jpg
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659375#action_12659375 ] 

Ted Dunning commented on MAHOUT-30:
-----------------------------------


That would be great.  The wiki is really a better place.


> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: MAHOUT-30c.patch

This patch fixes a randomization problem caused by using two Random instances and adds several display tests that were used to generate the examples in the wiki.

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659218#action_12659218 ] 

Isabel Drost commented on MAHOUT-30:
------------------------------------

First of all: Great work, Jeff. I finally found some time to have a closer look at the code. To me it already looks pretty clear and easy to understand. Some minor comments:

I did not have a close look at the code for displaying the clustering process so far. If it is to be retained in the final version it might be a good idea to move that into its own package?

I was wondering why you wrapped two math packages (blog and commons-math). Maybe it helps if you shortly name advantages and shortcomings of either? I was missing a pointer to Ted's patch to commons math. Maybe I just overlooked it?

In the patch I am missing changes to the dependencies of the pom.xml. I guess you would check in the libraries the patch is depending on into our libs directory? On first sight the license of BLOG as well its dependencies seems fine, any one else verified this?

Me personally, I would love to see the mathematical fomulae behind the implementation in the docs as well, maybe a pointer to a book chapter/ publication or other source that explains the algorithm in more detail.

Looking through the code, I found it a little irritating to have classes ModelDistribution, NormalModelDistribution, and DirichletDistribution around. As the only method in the interface ModelDistribution is sampleFromPrior, it might be clearer if it were named ModelSampler? But maybe it is just the time of day I looked at it...



> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: dirichlet.tar

This file contains a tar of a standalone Eclipse project named Dirichlet Cluster. The directory structure is self-contained and only depends upon the Mahout project. It uses the Gson beta1.3 jar file for serialization/deserialization and that is included. 

This version contains an initial Hadoop implementation that has been tested through 10 iterations. In each iteration, clusters are read from the previous iteration (in the state-i directories) and points are assigned to clusters in the Mapper. The reducer then observes all points for each cluster and computes new clusters which are output for the subsequent iteration. The unit test TestMapReduce.testDriverMRIterations() creates 400 data points and runs the MR Driver. Then it gathers all of the state files and summarizes them on the console.

I noticed when I replaced the beta distribution code earlier that the clustering now tends to put everything into the same cluster. I'm suspicious about the beta values that are being computed and need to investigate this further. 

I think this design will allow an arbitrary number of Mappers and Reducers up to the number of clusters. There is a stub Combiner class that is not currently used. I will continue to develop unit tests but I wanted to get this into view because it is a real first light MR implementation.

Jeff

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: dirichlet.tar, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: MAHOUT-30b.patch

This patch is a complete implementation of a non-M/R Dirichlet Process clustering algorithm. It has been refactored significantly from the above. I will describe these changes in a subsequent posting. I'd like to commit this to trunk in time for 0.1 if people agree.

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Status: Patch Available  (was: Open)

This patch is completely self-contained as it depends only upon the uncommons.math package for its distributions. I think this is ready to commit to trunk, after the release of course. The next steps will be to implement a M/R algorithm using this basic approach.

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch, MAHOUT-30d.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-30) dirichlet process implementation

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659282#action_12659282 ] 

Isabel Drost commented on MAHOUT-30:
------------------------------------

> Regarding the question of whether something should be called a Distribution or a Sampler, the mathematical terminology is that a distribution is 
> something you can sample so the the Distribution terminology would be most compatible that way. The fact that only one method is currently 
> defined is likely a temporary thing ... other methods could well be required for later efforts.

I understand. I do not have any strong objections. I think, a short class comment in DirichletDistribution would already help to avoid at least my confusion. (Although not trying to understand code at 2a.m. local time might help as well... ;) )

> I was thinking of moving the display code into the examples directory.

Sounds like a great idea to me.

> I did that so Ted could use his favorite library but he has not been pursuing it. I'm happy with blog and, as commons does not have the 
> needed sampling methods without Ted's patches, suggest we could go with blog. Removing the plugability would clean up the code some too.

Do you know what the current status of the patches is? I must admit I have a slight preference for commons-math as well, in case they support what we need.


> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: jeastman.vcf, MAHOUT-30.patch, MAHOUT-30b.patch, MAHOUT-30c.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-30) dirichlet process implementation

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647186#action_12647186 ] 

jeastman edited comment on MAHOUT-30 at 11/12/08 8:50 PM:
--------------------------------------------------------------

I refactored again and was able eliminate materializing of the posterior {{data}} sets by adding {{observe()}} and {{computeParameters()}} operations to {{Model}}. The idea is that all models begin in their prior state and are asked to observe each sample that is assigned to them. Then, before {{pdf()}} is called on them in the next iteration a call to {{computeParameters()}} finalizes the parameters once and turns the model into a posterior model. I also compute {{counts}} on the fly to eliminate materializing {{z}} altogether. I hope I didn't throw the baby out with the bath water.

Finally, I introduced a {{DirichletState}} bean to hold the models, dirichlet distribution and the mixture, simplifying the arguments and, I think, fixing a bug in the earlier refactoring. The algorithm runs over 10,000 points and produces the following outputs (prior() indicates a model with no observations, n is the number of observations, m the mean and sd the std):

Generating 4000 samples m=[1.0, 1.0] sd=3.0
Generating 3000 samples m=[1.0, 0.0] sd=0.1
Generating 3000 samples m=[0.0, 1.0] sd=0.1

* sample[0]= [prior(), normal(n=6604 m=[0.67, 0.63] sd=1.11), normal(n=86 m=[0.77, 2.81] sd=2.15), prior(), normal(n=242 m=[2.89, 1.67] sd=2.14), normal(n=2532 m=[0.53, 0.55] sd=0.69), normal(n=339 m=[0.99, 1.70] sd=2.18), normal(n=77 m=[0.53, 0.47] sd=0.51), normal(n=119 m=[0.36, 0.47] sd=2.85), normal(n=1 m=[0.00, 0.00] sd=0.33)]
* sample[1]= [prior(), normal(n=6626 m=[0.62, 0.54] sd=0.91), normal(n=137 m=[0.51, 2.99] sd=1.56), normal(n=2 m=[0.57, 0.25] sd=0.70), normal(n=506 m=[2.55, 0.93] sd=1.73), normal(n=1573 m=[0.38, 0.60] sd=0.50), normal(n=848 m=[0.81, 1.59] sd=2.11), normal(n=67 m=[0.76, 0.31] sd=0.45), normal(n=240 m=[0.73, 0.31] sd=2.24), normal(n=1 m=[0.00, 0.00] sd=0.98)]
* sample[2]= [prior(), normal(n=5842 m=[0.67, 0.39] sd=0.73), normal(n=157 m=[0.73, 3.12] sd=1.14), prior(), normal(n=655 m=[2.32, 0.64] sd=1.60), normal(n=1439 m=[0.00, 1.00] sd=0.33), normal(n=1439 m=[0.78, 1.53] sd=1.89), normal(n=66 m=[0.96, -0.04] sd=0.24), normal(n=399 m=[0.63, -0.03] sd=1.99), normal(n=3 m=[-0.07, 0.76] sd=0.41)]

{code:title=Model}
/**
 * A model is a probability distribution over observed data points and allows 
 * the probability of any data point to be computed.
 */
public interface Model<Observation> {
  
  /**
   * Observe the given observation, retaining information about it
   * 
   * @param x an Observation from the posterior
   */
  public abstract void observe(Observation x);
  
  /**
   * Compute a new set of posterior parameters based upon the Observations 
   * that have been observed since my creation
   */
  public abstract void computeParameters();

  /**
  * Return the probability that the observation is described by this model
  * 
  * @param x an Observation from the posterior
  * @return the probability that x is in z
  */
  public abstract double pdf(Observation x);
}
{code}

{code:title=DirichletCluster}
  /**
   * Initialize the variables and run the iterations to assign the sample data
   * points to a computed number of clusters
   *
   * @return a List<List<Model<Observation>>> of the observed models
   */
  public List<List<Model<Observation>>> dirichletCluster() {
    DirichletState<Observation> state = initializeState();

    // create a posterior sample list to collect results
    List<List<Model<Observation>>> clusterSamples = new ArrayList<List<Model<Observation>>>();

    // now iterate
    for (int iteration = 0; iteration < maxIterations; iteration++)
      iterate(state, iteration, clusterSamples);

    return clusterSamples;
  }

  /**
   * Initialize the state of the computation
   * 
   * @return the DirichletState
   */
  private DirichletState<Observation> initializeState() {
    // get initial prior models
    List<Model<Observation>> models = createPriorModels();
    // create the initial distribution.
    DirichletDistribution distribution = new DirichletDistribution(maxClusters,
        alpha_0, dist);
    // mixture parameters are sampled from the Dirichlet distribution. 
    Vector mixture = distribution.sample();
    return new DirichletState<Observation>(models, distribution, mixture);
  }

  /**
   * Create a list of prior models
   * @return the Observation
   */
  private List<Model<Observation>> createPriorModels() {
    List<Model<Observation>> models = new ArrayList<Model<Observation>>();
    for (int k = 0; k < maxClusters; k++) {
      models.add(modelFactory.sampleFromPrior());
    }
    return models;
  }

  /**
   * Perform one iteration of the clustering process, updating the state for the next iteration
   * @param state the DirichletState<Observation> of this iteration
   * @param iteration the int iteration number
   * @param clusterSamples a List<List<Model<Observation>>> that will be modified in each iteration
   */
  private void iterate(DirichletState<Observation> state, int iteration,
      List<List<Model<Observation>>> clusterSamples) {

    // create new prior models
    List<Model<Observation>> newModels = createPriorModels();

    // initialize vector of membership counts
    Vector counts = new DenseVector(maxClusters);
    counts.assign(alpha_0 / maxClusters);

    // iterate over the samples
    for (Observation x : sampleData) {
      // compute vector of probabilities x is described by each model
      Vector pi = computeProbabilities(state, x);
      // then pick one cluster by sampling a Multinomial distribution based upon them
      // see: http://en.wikipedia.org/wiki/Multinomial_distribution
      int model = dist.rmultinom(pi);
      // ask the selected model to observe the datum
      newModels.get(model).observe(x);
      // record counts for the model
      counts.set(model, counts.get(model) + 1);
    }

    // compute new model parameters based upon observations
    for (Model<Observation> m : newModels)
      m.computeParameters();

    // update the state from the new models and counts
    state.distribution.setAlpha(counts);
    state.mixture = state.distribution.sample();
    state.models = newModels;

    // periodically add models to cluster samples after getting started
    if ((iteration > burnin) && (iteration % thin == 0))
      clusterSamples.add(state.models);
  }

  /**
   * Compute a normalized vector of probabilities that x is described
   * by each model using the mixture and the model pdfs
   * 
   * @param state the DirichletState<Observation> of this iteration
   * @param x an Observation
   * @return the Vector of probabilities
   */
  private Vector computeProbabilities(DirichletState<Observation> state,
      Observation x) {
    Vector pi = new DenseVector(maxClusters);
    double max = 0;
    for (int k = 0; k < maxClusters; k++) {
      double p = state.mixture.get(k) * state.models.get(k).pdf(x);
      pi.set(k, p);
      if (max < p)
        max = p;
    }
    // normalize the probabilities by largest observed value
    pi.assign(new TimesFunction(), 1.0 / max);
    return pi;
  }
{code}


      was (Author: jeastman):
    I refactored again and was able eliminate materializing of the posterior {{data}} sets by adding {{observe()}} and {{computeParameters()}} operations to {{Model}}. The idea is that all models begin in their prior state and are asked to observe each sample that is assigned to them. Then, before {{pdf()}} is called on them in the next iteration a call to {{computeParameters()}} finalizes the parameters once and turns the model into a posterior model. I also compute {{counts}} on the fly to eliminate materializing {{z}} altogether. I hope I didn't throw the baby out with the bath water.

Finally, I introduced a {{DirichletState}} bean to hold the models, dirichlet distribution and the mixture, simplifying the arguments and, I think, fixing a bug in the earlier refactoring. The algorithm runs over 10,000 points and produces the following outputs (prior() indicates a model with no observations, n is the number of observations, m the mean and sd the std):

Generating 4000 samples m=[1.0, 1.0] sd=3.0
Generating 3000 samples m=[1.0, 0.0] sd=0.1
Generating 3000 samples m=[0.0, 1.0] sd=0.1

* sample[0]= [prior(), normal(n=6604 m=[0.67, 0.63] sd=1.11), normal(n=86 m=[0.77, 2.81] sd=2.15), prior(), normal(n=242 m=[2.89, 1.67] sd=2.14), normal(n=2532 m=[0.53, 0.55] sd=0.69), normal(n=339 m=[0.99, 1.70] sd=2.18), normal(n=77 m=[0.53, 0.47] sd=0.51), normal(n=119 m=[0.36, 0.47] sd=2.85), normal(n=1 m=[0.00, 0.00] sd=0.33)]
* sample[1]= [prior(), normal(n=6626 m=[0.62, 0.54] sd=0.91), normal(n=137 m=[0.51, 2.99] sd=1.56), normal(n=2 m=[0.57, 0.25] sd=0.70), normal(n=506 m=[2.55, 0.93] sd=1.73), normal(n=1573 m=[0.38, 0.60] sd=0.50), normal(n=848 m=[0.81, 1.59] sd=2.11), normal(n=67 m=[0.76, 0.31] sd=0.45), normal(n=240 m=[0.73, 0.31] sd=2.24), normal(n=1 m=[0.00, 0.00] sd=0.98)]
* sample[2]= [prior(), normal(n=5842 m=[0.67, 0.39] sd=0.73), normal(n=157 m=[0.73, 3.12] sd=1.14), prior(), normal(n=655 m=[2.32, 0.64] sd=1.60), normal(n=1439 m=[0.00, 1.00] sd=0.33), normal(n=1439 m=[0.78, 1.53] sd=1.89), normal(n=66 m=[0.96, -0.04] sd=0.24), normal(n=399 m=[0.63, -0.03] sd=1.99), normal(n=3 m=[-0.07, 0.76] sd=0.41)]

{code:title=Model}
/**
 * A model is a probability distribution over observed data points and allows 
 * the probability of any data point to be computed.
 */
public interface Model<Observation> {
  
  /**
   * Observe the given observation, retaining information about it
   * 
   * @param x an Observation from the posterior
   */
  public abstract void observe(Observation x);
  
  /**
   * Compute a new set of posterior parameters based upon the Observations 
   * that have been observed since my creation
   */
  public abstract void computeParameters();

  /**
  * Return the probability that the observation is described by this model
  * 
  * @param x an Observation from the posterior
  * @return the probability that x is in z
  */
  public abstract double pdf(Observation x);
}
{code}

{code:title=DirichletCluster}
  /**
   * Initialize the variables and run the iterations to assign the sample data
   * points to a computed number of clusters
   *
   * @return a List<List<Model<Observation>>> of the observed models
   */
  public List<List<Model<Observation>>> dirichletCluster() {
    DirichletState<Observation> state = initializeState();

    // create a posterior sample list to collect results
    List<List<Model<Observation>>> clusterSamples = new ArrayList<List<Model<Observation>>>();

    // now iterate
    for (int iteration = 0; iteration < maxIterations; iteration++)
      iterate(state, iteration, clusterSamples);

    return clusterSamples;
  }

  /**
   * Initialize the state of the computation
   * 
   * @return the DirichletState
   */
  private DirichletState<Observation> initializeState() {
    // get initial prior models
    List<Model<Observation>> models = createPriorModels();
    // create the initial distribution.
    DirichletDistribution distribution = new DirichletDistribution(maxClusters,
        alpha_0, dist);
    // mixture parameters are sampled from the Dirichlet distribution. 
    Vector mixture = distribution.sample();
    return new DirichletState<Observation>(models, distribution, mixture);
  }

  /**
   * Create a list of prior models
   * @return the Observation
   */
  private List<Model<Observation>> createPriorModels() {
    List<Model<Observation>> models = new ArrayList<Model<Observation>>();
    for (int k = 0; k < maxClusters; k++) {
      models.add(modelFactory.sampleFromPrior());
    }
    return models;
  }

  /**
   * Perform one iteration of the clustering process, updating the state for the next iteration
   * @param state the DirichletState<Observation> of this iteration
   * @param iteration the int iteration number
   * @param clusterSamples a List<List<Model<Observation>>> that will be modified in each iteration
   */
  private void iterate(DirichletState<Observation> state, int iteration,
      List<List<Model<Observation>>> clusterSamples) {

    // create new prior models
    List<Model<Observation>> newModels = createPriorModels();

    // initialize vector of membership counts
    Vector counts = new DenseVector(maxClusters);
    counts.assign(alpha_0 / maxClusters);

    // iterate over the samples
    for (int i = 0; i < sampleData.size(); i++) {
      Observation x = sampleData.get(i);
      // compute vector of probabilities x is described by each model
      Vector pi = computeProbabilities(state, x);
      // then pick one cluster by sampling a Multinomial distribution based upon them
      // see: http://en.wikipedia.org/wiki/Multinomial_distribution
      int model = dist.rmultinom(pi);
      // ask the selected model to observe the datum
      newModels.get(model).observe(x);
      // record counts for the model
      counts.set(model, counts.get(model) + 1);
    }

    // compute new model parameters based upon observations
    for (int k = 0; k < maxClusters; k++)
      newModels.get(k).computeParameters();

    // update the state from the new models and counts
    state.distribution.setAlpha(counts);
    state.mixture = state.distribution.sample();
    state.models = newModels;

    // periodically add models to cluster samples after getting started
    if ((iteration > burnin) && (iteration % thin == 0))
      clusterSamples.add(state.models);
  }

  /**
   * Compute a normalized vector of probabilities that x is described
   * by each model using the mixture and the model pdfs
   * 
   * @param state the DirichletState<Observation> of this iteration
   * @param x an Observation
   * @return the Vector of probabilities
   */
  private Vector computeProbabilities(DirichletState<Observation> state,
      Observation x) {
    Vector pi = new DenseVector(maxClusters);
    double max = 0;
    for (int k = 0; k < maxClusters; k++) {
      double p = state.mixture.get(k) * state.models.get(k).pdf(x);
      pi.set(k, p);
      if (max < p)
        max = p;
    }
    // normalize the probabilities by largest observed value
    pi.assign(new TimesFunction(), 1.0 / max);
    return pi;
  }
{code}

  
> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>         Attachments: MAHOUT-30.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.