You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Cristi Prodan (JIRA)" <ji...@apache.org> on 2010/08/02 09:14:16 UTC
[jira] Commented: (MAHOUT-344) Minhash based clustering

    [ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894476#action_12894476 ] 

Cristi Prodan commented on MAHOUT-344:
--------------------------------------

Sorry for the big delay - I had to finish my dissertation and other stuff. Anyways, here are some things I managed to do and for which I commit a patch:

- Converted the existing code to work with hadoop 0.20+ 
- Converted the input of the algorithm to RandomAccessSparseVector 
- Converted the LastFM db into RandomAccessSparseVector format
- Added command options using the DefaultOptionCreator mechanism
Running the MinHash clustering algorithm can be done using a configuration like this:

-i(--input) lastfm/med_db_seq 
-o(--output) lastfm/med_db_clusters 
-mcs(--minClusterSize) 5 
-nh(--numHashFunctions) 2 
-kg(--keyGroups) 2  
-ow(--overwriteOutput)

- Evaluating the clustering results with the above configuration using the metric suggested by Ankur yields a value of 0.20303965982542901 .. which is not to good IMO. I will still run tests with other parameters and see what happens. 

The following steps are the following:
1. write tests for the current code (doing this now);
2. refactor the code so that it uses "points" and "vectors" instead of "items" and "users". The algorithm will also cluster text files, for finding very similar files. 
3. Write some documentation on how to use the algorithm. 
4. Investigate a more general format for the output algorithm (Vectors or something like that). 

If you have any suggestions I would very much like to hear them. 



> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed  to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.