You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2008/04/14 18:13:04 UTC

[jira] Updated: (MAHOUT-19) Hierarchial clusterer

     [ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: TestBottomFeed.test.png
                TestTopFeed.test.png
                MAHOUT-19.txt

Implementation with non distributed top feed and bottom feed. The latter is used by the map reduced tree feed when the tree is too you for it to make sense to distribute. Top feed is most a proof of concept to show the difference.

Patch also contains MAHOUT-20 and MAHOUT-36.

Attached is also the graphviz output from the generated test data I've placed in test/resources.

There are a couple of changes and additions to the core that should be extracted to own patches:

 * build.xml creates a job file like nutch does.
 * Mht - reads ARFFish text files to a Matrix and map ordinal values to doubles. 
 * Vector#fill(double)
 * Vector#fill(double, int, int)
 * VectorStringUtils - converts a vector to a string and vice verse
 * DoubleWritable - see HADOOP-3243
 * SparseVectorWritable - accepts any vector but resolves as SparseVector


I simulate the hadoop framework to test DistributedBottomFeed as all driver code is inside of the same class and will not run for obvious reasons. I'll refactor and get it running in hadoop for real soon.

The tree is an abstract relation storage that needs lots of refactoring.

>From Package.html:

h3. Feeding a tree with instances

All nodes in the tree are either braches with two children, a leaf (instance) or the root.

{noformat}
      root
      2,75
     /   \
  leaf1  branch1
   4.0     1,5
          /   \
       leaf2  leaf3
        1,0    2,0
{noformat}

A new instance is inserted in the tree as the child of a new branch next to the closes existing node.

Insert new instance 1,2:

{noformat}
      root
      2,775
     /   \
  leaf1  branch1
   4.0     1,55
          /   \
       brach2 leaf3
         1,1   2,0
        /   \
     leaf2 leaf4
      1,0   1,2
{noformat}

<p>
  Adding an instance requires sequencially recalculate the mean value (vector space) of all the nodes from the new leaf
  to root. This is a <b>non distributable operation</b>. Finding the where to insert the new leaf is however.
  Traditionally this is done by top or bottom feeding it, i.e. find the closest leaf node by navigating from root
  towards the closest child until a leaf is reached, or by comparing against all leafs and from there navigate the the
  closest node.
</p>

<p>
  The first distributed implementation will assert you have an eternal amount of Hadoop nodes and calculate the distance
  to all nodes: root, branches and leafs, in one large job in search for the closest one.
</p>

<p>
  A second implementation will spawn multiple jobs before it knows where to insert an instance:
  <ul>
    <li>find closest leaf node</li>
    <li>find closest node between closest leaf node and root</li>
  </ul>
</p>

<p>
  Adding instances to a young tree does not make sense to distribute! The first couple of hundred (or so depending on
  vector size) instances could be bottom fed non distributed. (By the way, what is a good antonym to distributed?)
</p>

h3. Clustering

<p>
  Clustering is to navigate the tree from a given node using some set of rules. Usually this means looking at the
  distance between children of a node or the parent of a node. The default max distance computations are based on
  mean distance between leafs that are siblings to other leafs, or the mean distance to all leaf node siblings. I'm
  certain there are much better solutions based on distance to root squared out in ways together with previous mentioned
  values. Or find the value using classes in training data? What algorithm could that be?
</p>

<p>
  If set very sensitive for what to include in a cluster one could train use a classifier for each cluster to allow
  neighbouring clusters to join with the classifying cluster. And perhaps even connect clusters that seems to contain
  the same class even if the clusters are far away from each other. But that has to wait until we have classifiers..
</p>

h3. Persistency

<p>
  The tree is an abstract relational object storage for distances, vectores, clusters, et c., accessed via primary keys
  of the nodes. Only a HashMap implementation is available.
</p>
{html}

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.
> For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.