You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2008/11/12 04:43:44 UTC

[jira] Commented: (MAHOUT-30) dirichlet process implementation

    [ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646790#action_12646790 ] 

Jeff Eastman commented on MAHOUT-30:
------------------------------------

I did some refactoring to better localize the major processing steps in DirichletCluster. The first refactoring was to create an iterate method that handles each iteration after the initialization phase initializes the models, Dirichlet and mixture. Then I factored out the three sub-processing steps: assignPointsToModels, recomputeModels and computeNewMixtureParameters and moved some of the instance variables to locals. Looking at the result it is starting to feel more like the other clustering algorithms so I'm starting to think in terms of an M/R partitioning. 

* In assignPointsToModels, we are iterating over all of the points, calculating their pdf with respect to the current clusters (models), applying the mixture probabilities and then using a multinomial sampling to assign them to a single cluster for this iteration. In a subsequent iteration we might assign them to different models, depending upon the probabilities.
* In recomputeModels, we use the model assignments from this iteration to compute the posterior samples for each of the models. Then we generate a new model that best describes this sub-sample. The ah-ha for me here is that we create entirely new models in each iteration.
* In iterate, we periodically output the models so we can gain whatever wisdom we wish from their parameters. I didn't factor that out but may later.
* in computeNewMixtureParameters we use the same model assignments to compute a new alpha vector for the Dirichlet distribution and a new mixture vector too. Then we iterate over the points again, continually refining these values. The ah-ha for me here is we do not actually remember the model assignments. _I think, after the models have all been computed, we could run another pass over the points computing the pdf for each cluster as a probability vector. We do this in Kmeans too._

The first step seems very similar to the KmeansMapper: we read the Dirichlet, mixture and the current models into each mapper, then compute the multinomial and output the point keyed by its model index to the reducer. In a single reducer, all of the points are seen so we can calculate the posterior statistics if we switch the model to use a running sums (s0, s1, s2) algorithm for the std. Following this approach, it might be we could use a combiner to maintain the running sums (assuming Hadoop only runs it once for now) thus reducing the heavy load on the poor reducer. With a single reducer, I'm pretty sure we can compute the new mixture and Dirichlet parameters. With multiple reducers I think those calculations would need help combing the reducer values in the driver.

Ted, I know you've thought about this a lot, do these steps sound reasonable?


{code:title=DirichletCluster.java}
 /**
   * Perform one iteration of the clustering process, updating the Dirichlet distribution
   * and returning the new mixtures
   * 
   * @param models a List<Model<Vector>> of models
   * @param dir a DirichletDistribution that will be modified in each iteration
   * @param mixture a Vector mixture 
   * @param clusterSamples a List<List<Model<Vector>>> that will be modified in each iteration
   * @param iteration the int iteration number
   * @return a new mixture Vector
   */
  private Vector iterate(List<Model<Vector>> models, DirichletDistribution dir,
      Vector mixture, List<List<Model<Vector>>> clusterSamples, int iteration) {
    // z holds the assignment of cluster numbers to points for this iteration
    Vector z = assignPointsToModels(models, mixture);
    models = recomputeModels(z);

    // periodically add models to cluster samples after getting started
    if ((iteration > burnin) && (iteration % thin == 0))
      clusterSamples.add(models);

    // compute and return the new mixture parameters (adjusting 
    // the Dirichlet dir's alpha vector in the process
    return computeNewMixtureParameters(dir, z);
  }

  /**
   * Assign all points to a model based upon the current mixture parameters and
   * the probability that each point is described by each model's pdf.
   * 
   * @param models the current List<Model<Vector>> of models
   * @param mixture the Vector of mixture probabilities
   * @return z the Vector of assignment of points to models
   */
  private Vector assignPointsToModels(List<Model<Vector>> models, Vector mixture) {
    // z holds the assignment of cluster numbers to points for this iteration
    Vector z = new DenseVector(sampleData.size());

    Vector pi = new DenseVector(maxClusters);
    for (int ix = 0; ix < sampleData.size(); i++) {
      Vector x = sampleData.get(ix);
      // compute probabilities for each cluster, saving the largest
      double max = 0;
      for (int k = 0; k < maxClusters; k++) {
        double p = mixture.get(k) * models.get(k).pdf(x);
        pi.set(k, p);
        if (max < p)
          max = p;
      }
      // normalize the probabilities by largest observed value
      pi.assign(new TimesFunction(), 1.0 / max);
      // then pick one cluster by sampling a multinomial based upon the probabilities
      int model = dist.rmultinom(pi);
      z.set(i, model);
    }
    return z;
  }

  /**
   * Recompute new models based upon the assignment of points to models and return them
   * 
   * @param z the Vector of assignment of points to models
   * @return the List<Model<Vector>> of models
   */
  private List<Model<Vector>> recomputeModels(Vector z) {
    List<Model<Vector>> newModels = new ArrayList<Model<Vector>>();
    for (int k = 0; k < maxClusters; k++) {
      // collect all data assigned to each cluster
      List<Vector> data = new ArrayList<Vector>();
      for (int i = 0; i < sampleData.size(); i++)
        if (z.get(i) == k)
          data.add(sampleData.get(i));
      // add a new posterior model if any data else use a prior model
      if (data.size() > 0)
        newModels.add(modelFactory.sampleFromPosterior(data));
      else
        newModels.add(modelFactory.sampleFromPrior());
    }
    return newModels;
  }

  /**
   * Compute a new mixture based upon the Dirichlet distribution and the assignment of points to models.
   * 
   * @param dir the DirichletDistribution, which is modified during this process
   * @param z the Vector of assignment of points to models
   * @return the new mixture Vector
   */
  private Vector computeNewMixtureParameters(DirichletDistribution dir, Vector z) {
    Vector mixture;
    Vector a = new DenseVector(maxClusters);
    a.assign(alpha_0 / maxClusters);
    for (int i = 0; i < z.size(); i++) {
      int z_i = (int) z.get(i);
      a.set(z_i, a.get(z_i) + 1);
    }
    dir.setAlpha(a);
    mixture = dir.sample();
    return mixture;
  }
{code}

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>         Attachments: MAHOUT-30.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.