You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/08/17 00:48:31 UTC

Clustering Issues

  Hi All,

I'm back very relaxed and tan and was a bit intimidated by the 750+ 
postings but I waded through them and it looks like MAHOUT-479 is the 
thing for me to focus on in the short term. In that regard, and with the 
recent refactoring of all the clustering data structures around 
AbstractCluster, plus the driver changes to unify under AbstractJob, can 
somebody net out the remaining areas for improvement?

Currently, I see an overlap between Model and Cluster which is evident 
when all Models support Cluster and all AbstractClusters support similar 
observation methods to those in Model. It might be that factoring out an 
Observable interface from Model and making AbstractCluster support it 
would clean this up but it bears further discussion. Something like:

// unifies common operations relating to observing posterior data
interface Observable {
   void observe(Vector, double);
   void observe(Vector);
   void observe(ClusterObservations);
   void computeParameters();
   ClusterObservations getObservations();
}

// unifies common attributes needed by ClusterDumper and graphical 
display routines
interface Cluster {
   int getId();
   Vector getCenter();
   Vector getRadius();
   int getNumPoints();
   String asFormatString(String[]);
   String asJsonString();
}

// specific to Dirichlet-style probabilistic clusters
interface Model extends Writable, Cluster, Observable {
   double pdf(Vector);
}

// base class for non-Dirichlet clustering implementations
abstract AbstractCluster implements Writable, Cluster, Observable {}

This would offer some improvement IMHO but I'm not clear if this would 
improve plug-n-playness with Classification or not. Is there a more 
theoretical "DistanceMeasure"-like discussion that would be germane?

Jeff