You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2010/01/27 18:46:34 UTC

[jira] Created: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

Make ClusterDumper dump Dirichlet clusters too
----------------------------------------------

                 Key: MAHOUT-270
                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.2
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman


Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806873#action_12806873 ] 

Jeff Eastman commented on MAHOUT-270:
-------------------------------------

In the beginning, vectors, canopies and clusters needed a serialization mechanism and asFormatString() was invented. Also invented but not expressed in interfaces were their deserialization counterparts, static methods decodeCanopy(), decodeCluster() and decodeVector(). These ad-hoc encodings worked adequately for a time but were soon replaced by standard Json encodings as newer entities embodied more complicated state and the ad-hoc methods became unworkable. Shortly after that, the quest for speed (and improvements in Hadoop support) led to the adoption of Writable encodings and SequenceFiles by all Mahout entities.

Of course, binary encodings are impossible to use for debugging so some clustering entities use asFormatString() as their toString() implementations and also as a human-readable option for final output. As more kinds of clustering were implemented some refactoring was indicated and ClusterBase was invented to abstract out the center and centroid calculations common among them. Then came Dirichlet which has no notion of centers, nor centroid calculations so it makes little sense to generalize them under ClusterBase. DirichletClusters have only a domain-specific Model and totalCount and these are serialized/deserialized entirely using Writable (asFormatString() only prints the model's toString() output and there is no decode() static method).

Even more recently, users doing text clustering needed better sparse vector implementations and utilities for working with term vectors. ClusterDumper and VectorHelper utilities were added to meet these needs. ClusterDumper can output either a Json encoding of the center of a cluster or a VectorHelper.vectorToString() representation which can include a term dictionary to make the output more human-readable.

It should now be obvious to all that making ClusterDumper dump DirichletClusters too will take some serious refactoring. I have some thoughts about how to accomplish that, but it seems to be a good time to revisit the user requirements so we do not perpetuate unnecessary or obsolete stuff. Could I get some comments on the following requirements?

1. We need an efficient, binary encoding for serialization and deserialization. (I take this as a given and that Writable is it, but feel free to disagree)
2. We need a Json encoding encoding for serialization and deserialization. 
3. We need a complete, human-readable encoding for output only. (Json qualifies here)
4. We need a human-readable encoding for output only. (Json qualifies here too but others may be more usable)
5. We need a human-readable toString() encoding for debugging only.

> Make ClusterDumper dump Dirichlet clusters too
> ----------------------------------------------
>
>                 Key: MAHOUT-270
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831678#action_12831678 ] 

Jeff Eastman edited comment on MAHOUT-270 at 2/9/10 9:39 PM:
-------------------------------------------------------------

r908235 commits the Printable interface and implements it in ClusterBase, DirichletCluster and Model. It does not modify ClusterDumper so current dump formats are unchanged, but that is the remaining task to complete this issue.

Here is what was done versus the above requirements:
1. We need an efficient, binary encoding for serialization and deserialization: No changes to Writable utilization
2. We need a Json encoding encoding for serialization and deserialization: asJsonString in Printable interface supports this
3. We need a complete, human-readable encoding for output only: recommend using asJsonString if completeness is needed
4. We need a human-readable encoding for output only: asFormatString(bindings) implements simplified, more readable notations with optional bindings
5. We need a human-readable toString() encoding for debugging only: toString generally calls asFormatString(null)

      was (Author: jeastman):
    r908235 commits the Printable interface and implements it in ClusterBase, DirichletCluster and Model. It does not modify ClusterDumper so current dump formats are unchanged, but that is the remaining task to complete this issue
  
> Make ClusterDumper dump Dirichlet clusters too
> ----------------------------------------------
>
>                 Key: MAHOUT-270
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830741#action_12830741 ] 

Jeff Eastman commented on MAHOUT-270:
-------------------------------------

I'd like to deprecate the asFormatString() methods in Vector, Matrix and elsewhere, replacing instead with a new interface:

{code}
public interface Printable {

  /**
   * Produce a custom, printable representation of the receiver.
   * 
   * @param bindings an optional String[] containing label bindings used to format the primary 
   *    Vector/s of this implementation.
   * @return a String
   */
  public String asFormatString(String[] bindings);

  /**
   * Produce a printable representation of the receiver using Json. (Label bindings
   * are transient and not part of the Json representation)
   * 
   * @return a Json String
   */
  public String asJsonString();

}
{code}

This interface would be implemented by all classes that currently implement Writable and which need to be dumped by the cluster dumper. This includes ClusterBase and DirichletCluster and Model. Implementing these changes would allow the cluster dumper to dump both species of clusters.


> Make ClusterDumper dump Dirichlet clusters too
> ----------------------------------------------
>
>                 Key: MAHOUT-270
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854742#action_12854742 ] 

Jeff Eastman commented on MAHOUT-270:
-------------------------------------

r931372 renames Printable to Cluster and adds getId, getNumPoints and getCenter methods needed by the original ClusterDumper. Updated the ClusterDumper to use the new interface. Added unit tests of Canopy, KMeans and Dirichlet all using the same ClusterDumper basic display. All tests run but more polishing and testing is needed to ensure all features are working correctly.

> Make ClusterDumper dump Dirichlet clusters too
> ----------------------------------------------
>
>                 Key: MAHOUT-270
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman resolved MAHOUT-270.
---------------------------------

    Resolution: Fixed

Resolving this as tests are all positive and code is stable

> Make ClusterDumper dump Dirichlet clusters too
> ----------------------------------------------
>
>                 Key: MAHOUT-270
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831678#action_12831678 ] 

Jeff Eastman commented on MAHOUT-270:
-------------------------------------

r908235 commits the Printable interface and implements it in ClusterBase, DirichletCluster and Model. It does not modify ClusterDumper so current dump formats are unchanged, but that is the remaining task to complete this issue

> Make ClusterDumper dump Dirichlet clusters too
> ----------------------------------------------
>
>                 Key: MAHOUT-270
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.