You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2012/03/07 00:01:01 UTC

[CONF] Apache Mahout > Minhash Clustering

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Minhash Clustering (https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering)


Edited by Suneel Marthi:
---------------------------------------------------------------------
Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor types of queries efficiently.

There is a MinHashDriver class which works in the TestMinHashClustering unit test. This is not included in the standard driver.props class, but it can be run by specifying the full package name.

h4. Running MinHashDriver on the Reuters-21578 Collection

There are two ways of doing this:

h5. Run cluster-reuters.sh

# cd $MAHOUT_HOME/examples/bin/cluster-reuters.sh  (trunk only)
# Select the Minhash algorithm when prompted.

h5. Step By Step

h6. 1.  Download the Reuters-21578 Dataset from [http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz] and extract this under /examples/reuters folder.


The Reuters-21578 collection has about 22578 documents in SGML format.  These need to be converted to text files to subsequently generate the SequenceFiles and SparseVectors.

To convert the SGML files to Text, we invoke the ExtractReuters utility that comes with Lucene. This creates text files from SGML containing - Title, Date, Body.

h6. 2.   Run the Reuters extraction code from the examples directory as follows:

mvn \-e \-q exec:java
\-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
\-Dexec.args="reuters/ reuters-extracted/"

h6. 3. Create SequenceFiles from the converted Reuters Text files

bin/mahout seqdirectory \-c UTF-8 \-i examples/reuters-extracted/ \-o reuters-seqfiles

This will write the Reuters documents into Sequence files.


h6. 4. Create SparseVectors from the SequenceFiles


bin/mahout seq2sparse \-i \-ng 1 reuters-seqfiles/ \-o reuters-vectors \-ow

The \--ow flag is used to denote whether or not to overwrite
the output folder

The \-ng flag is the maximum size of NGrams to be selected from collection of documents

h6. 5. Run the MinHashDriver on the generated SparseVectors

bin/mahout org.apache.mahout.clustering.minhash.MinHashDriver \--input reuters-vectors/tfidf-vectors/ \-o /minhash

The resulting output in /minhash/part-r-00000 would be something like below

97618498-357680743      /reut2-006.sgm-25.txt
97618498-357680743      /reut2-007.sgm-660.txt
97618498-61898030       /reut2-015.sgm-697.txt
97618498-61898030       /reut2-014.sgm-99.txt
97618498-61898030       /reut2-009.sgm-705.txt
97618498-61898030       /reut2-000.sgm-495.txt
97618498-61898030       /reut2-009.sgm-732.txt
97618498-61898030       /reut2-010.sgm-473.txt
97618498-61898030       /reut2-000.sgm-15.txt
97618498-61898030       /reut2-009.sgm-872.txt
97618498-61898030       /reut2-010.sgm-547.txt
97618498-61898030       /reut2-006.sgm-366.txt
97618498-61898030       /reut2-002.sgm-53.txt
97618498-61898030       /reut2-000.sgm-569.txt
97618498-61898030       /reut2-019.sgm-366.txt
97618498-61898030       /reut2-003.sgm-540.txt
97618498-61898030       /reut2-019.sgm-154.txt
97618498-61898030       /reut2-004.sgm-372.txt
97618498-61898030       /reut2-000.sgm-3.txt
97618498-61898030       /reut2-002.sgm-935.txt
97618498-61898030       /reut2-013.sgm-567.txt
97618498-61898030       /reut2-004.sgm-938.txt
97618498-61898030       /reut2-004.sgm-620.txt
97618498-92898924       /reut2-018.sgm-316.txt
97618498-92898924       /reut2-007.sgm-976.txt
97618498-92898924       /reut2-003.sgm-796.txt
97618498-92898924       /reut2-006.sgm-176.txt
97618498-92898924       /reut2-004.sgm-290.txt
97618498-92898924       /reut2-004.sgm-248.txt

The first column is the <Cluster-Id> and the second column is <reuters-text-filename>.

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action