You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2012/12/01 01:43:01 UTC
[CONF] Apache Mahout > Minhash Clustering
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Minhash Clustering (https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering)
Edited by John Lee:
---------------------------------------------------------------------
Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor types of queries efficiently.
The algorithm is decribed in
Broder, Andrei Z.(1997), "On the resemblance and containment of documents"
There is a MinHashDriver class which works in the TestMinHashClustering unit test. This is not included in the standard driver.props class, but it can be run by specifying the full package name.
The
Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action