You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Varun Thacker <va...@gmail.com> on 2011/10/19 17:38:20 UTC

MinHash Clustering in Mahout

I was trying to run the MinHash algorithm on the Reuters data set, so I did
the following before running MinHashDriver

   - Get the Reuters dataset
   - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
   reuters-out from reuters-sgm(the downloaded archive)
   - Run seqdirectory to convert reuters-out to SequenceFile format
   - Run seq2sparse to convert SequenceFiles to sparse vector format

I used these instructions from the K-means clustering wiki page.

This is the command I used to run MinHashDriver

./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
/home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash

The output file looks something like this:

106460162-207863047 /reut2-015.sgm-653.txt
106460162-207863047 /reut2-021.sgm-7.txt
106460162-207863047 /reut2-013.sgm-307.txt
106460162-207863047 /reut2-013.sgm-306.txt
106460162-207863047 /reut2-014.sgm-786.txt
106460162-207863047 /reut2-013.sgm-304.txt
106460162-207863047 /reut2-013.sgm-303.txt
106460162-207863047 /reut2-021.sgm-230.txt
106460162-207863047 /reut2-012.sgm-548.txt
106460162-207863047 /reut2-020.sgm-161.txt
106460162-207863047 /reut2-021.sgm-553.txt
106460162-207863047 /reut2-013.sgm-299.txt
106460162-207863047 /reut2-015.sgm-284.txt
106460162-207863047 /reut2-013.sgm-996.txt
106460162-207863047 /reut2-021.sgm-441.txt
106460162-207863047 /reut2-013.sgm-298.txt
106460162-207863047 /reut2-013.sgm-995.txt
106460162-207863047 /reut2-015.sgm-521.txt
106460162-207863047 /reut2-020.sgm-162.txt
106460162-207863047 /reut2-020.sgm-163.txt
106460162-207863047 /reut2-013.sgm-296.txt
...
...


Is this the correct way of running MinHash.

If yes then I would update the wiki page
https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
the instructions.

Otherwise if someone could tell me on what am I doing wrong.

-- 
Regards,
Varun Thacker
http://varunthacker.wordpress.com

Re: MinHash Clustering in Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 28, 2011, at 9:50 PM, Suneel Marthi wrote:

> On the same note, one other comment on the present Mahout MinHashDriver implementation:-
> 
> Once all the clusters have been built, we don't calculate the cluster precision - 'Jaccard Distance or Coefficient' anywhere in the MinHashDriver code. 

I think this is what the key groups is trying to get at, but isn't quite the same.  I don't know that the current implementation is broken, I think it simply doesn't fully implement the Broder paper you cite.  Perhaps the easiest thing to do is simply implement the Broder version.  No reason why we can't have two different approaches.

> 
> 
> There are 2 methods 'testPrecision' and 'computeSimilarity' in LastFmClusterEvaluator under examples, but this is only confined to the examples. 
> 
> 
> It may be a good idea to add 'JaccardDistance' measure to the existing Distance measures in Mahout (unless there was a reason for not having it in the first place).

TanimotoDistanceMeasure is the Jaccard Distance.

> 
> 
> 
> ________________________________
> From: Grant Ingersoll <gs...@apache.org>
> To: user@mahout.apache.org; Suneel Marthi <su...@yahoo.com> 
> Sent: Monday, November 28, 2011 11:42 PM
> Subject: Re: MinHash Clustering in Mahout
> 
> 
> On Oct 26, 2011, at 8:51 AM, Suneel Marthi wrote:
> 
>> I am still trying to fully understand minHash algorithm and I had the same results like below when running the MinHashDriver.
>> 
>> I have a use case wherein I need to determine the content similarity of 2 documents like what's been described in Andrei Broder's paper - 'Identifying and Filtering Near-Duplicate Documents' (http://dl.acm.org/citation.cfm?id=736184).  
> 
> I've only skimmed this paper, but I get the sense that we are not fully implementing what is in the paper.  I've asked the author of the patch for a citation on the implementation approach, but they haven't responded.  
> 
> It would be nice to resolve this, as I tend to agree w/ your assessment that there is something not quite right.  At a minimum, it seems one needs a higher number of key groups.
> 
>> 
>> I started dissecting the clusters generated by Mahout's MinHashDriver to compare document content equality and to determine how accurate the clustering was?
>> I do see that the first 2 files from the output below were put in the same cluster 106460162-207863047; thought the actual text content in both the files is different.  How?
>> 
>> I am assuming that the NGram attribute was set to the default value of 1 when creating the tf-idf vectors from sequence files.  
>> 
>> Suneel
>> 
>> 
>> 
>> ________________________________
>> From: Grant Ingersoll <gs...@apache.org>
>> To: user@mahout.apache.org
>> Sent: Tuesday, October 25, 2011 5:55 AM
>> Subject: Re: MinHash Clustering in Mahout
>> 
>> 
>> On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:
>> 
>>> I was trying to run the MinHash algorithm on the Reuters data set, so I did
>>> the following before running MinHashDriver
>>> 
>>>     - Get the Reuters dataset
>>>     - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>>>     reuters-out from reuters-sgm(the downloaded archive)
>>>     - Run seqdirectory to convert reuters-out to SequenceFile format
>>>     - Run seq2sparse to convert SequenceFiles to sparse vector format
>>> 
>>> I used these instructions from the K-means clustering wiki page.
>>> 
>>> This is the command I used to run MinHashDriver
>>> 
>>> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
>>> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
>>> 
>>> The output file looks something like this:
>>> 
>>> 106460162-207863047
>> /reut2-015.sgm-653.txt
>>> 106460162-207863047 /reut2-021.sgm-7.txt
>>> 106460162-207863047 /reut2-013.sgm-307.txt
>>> 106460162-207863047 /reut2-013.sgm-306.txt
>>> 106460162-207863047 /reut2-014.sgm-786.txt
>>> 106460162-207863047 /reut2-013.sgm-304.txt
>>> 106460162-207863047 /reut2-013.sgm-303.txt
>>> 106460162-207863047 /reut2-021.sgm-230.txt
>>> 106460162-207863047 /reut2-012.sgm-548.txt
>>> 106460162-207863047 /reut2-020.sgm-161.txt
>>> 106460162-207863047 /reut2-021.sgm-553.txt
>>> 106460162-207863047 /reut2-013.sgm-299.txt
>>> 106460162-207863047 /reut2-015.sgm-284.txt
>>> 106460162-207863047 /reut2-013.sgm-996.txt
>>> 106460162-207863047 /reut2-021.sgm-441.txt
>>> 106460162-207863047 /reut2-013.sgm-298.txt
>>> 106460162-207863047 /reut2-013.sgm-995.txt
>>> 106460162-207863047 /reut2-015.sgm-521.txt
>>> 106460162-207863047 /reut2-020.sgm-162.txt
>>> 106460162-207863047
>> /reut2-020.sgm-163.txt
>>> 106460162-207863047 /reut2-013.sgm-296.txt
>>> ...
>>> ...
>>> 
>>> 
>>> Is this the correct way of running MinHash.
>>> 
>>> If yes then I would update the wiki page
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
>>> the instructions.
>>> 
>>> Otherwise if someone could tell me on what am I doing wrong.
>> 
>> I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might be good to incorporate this into the build-reuters.sh as well as try it on some other input.
>> 
>> -Grant
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

Posted by Suneel Marthi <su...@yahoo.com>.

On the same note, one other comment on the present Mahout MinHashDriver implementation:-

Once all the clusters have been built, we don't calculate the cluster precision - 'Jaccard Distance or Coefficient' anywhere in the MinHashDriver code. 


There are 2 methods 'testPrecision' and 'computeSimilarity' in LastFmClusterEvaluator under examples, but this is only confined to the examples. 


It may be a good idea to add 'JaccardDistance' measure to the existing Distance measures in Mahout (unless there was a reason for not having it in the first place).



________________________________
 From: Grant Ingersoll <gs...@apache.org>
To: user@mahout.apache.org; Suneel Marthi <su...@yahoo.com> 
Sent: Monday, November 28, 2011 11:42 PM
Subject: Re: MinHash Clustering in Mahout
 

On Oct 26, 2011, at 8:51 AM, Suneel Marthi wrote:

> I am still trying to fully understand minHash algorithm and I had the same results like below when running the MinHashDriver.
> 
> I have a use case wherein I need to determine the content similarity of 2 documents like what's been described in Andrei Broder's paper - 'Identifying and Filtering Near-Duplicate Documents' (http://dl.acm.org/citation.cfm?id=736184).  

I've only skimmed this paper, but I get the sense that we are not fully implementing what is in the paper.  I've asked the author of the patch for a citation on the implementation approach, but they haven't responded.  

It would be nice to resolve this, as I tend to agree w/ your assessment that there is something not quite right.  At a minimum, it seems one needs a higher number of key groups.

> 
> I started dissecting the clusters generated by Mahout's MinHashDriver to compare document content equality and to determine how accurate the clustering was?
> I do see that the first 2 files from the output below were put in the same cluster 106460162-207863047; thought the actual text content in both the files is different.  How?
> 
> I am assuming that the NGram attribute was set to the default value of 1 when creating the tf-idf vectors from sequence files.  
> 
> Suneel
> 
> 
> 
> ________________________________
> From: Grant Ingersoll <gs...@apache.org>
> To: user@mahout.apache.org
> Sent: Tuesday, October 25, 2011 5:55 AM
> Subject: Re: MinHash Clustering in Mahout
> 
> 
> On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:
> 
>> I was trying to run the MinHash algorithm on the Reuters data set, so I did
>> the following before running MinHashDriver
>> 
>>    - Get the Reuters dataset
>>    - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>>    reuters-out from reuters-sgm(the downloaded archive)
>>    - Run seqdirectory to convert reuters-out to SequenceFile format
>>    - Run seq2sparse to convert SequenceFiles to sparse vector format
>> 
>> I used these instructions from the K-means clustering wiki page.
>> 
>> This is the command I used to run MinHashDriver
>> 
>> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
>> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
>> 
>> The output file looks something like this:
>> 
>> 106460162-207863047
> /reut2-015.sgm-653.txt
>> 106460162-207863047 /reut2-021.sgm-7.txt
>> 106460162-207863047 /reut2-013.sgm-307.txt
>> 106460162-207863047 /reut2-013.sgm-306.txt
>> 106460162-207863047 /reut2-014.sgm-786.txt
>> 106460162-207863047 /reut2-013.sgm-304.txt
>> 106460162-207863047 /reut2-013.sgm-303.txt
>> 106460162-207863047 /reut2-021.sgm-230.txt
>> 106460162-207863047 /reut2-012.sgm-548.txt
>> 106460162-207863047 /reut2-020.sgm-161.txt
>> 106460162-207863047 /reut2-021.sgm-553.txt
>> 106460162-207863047 /reut2-013.sgm-299.txt
>> 106460162-207863047 /reut2-015.sgm-284.txt
>> 106460162-207863047 /reut2-013.sgm-996.txt
>> 106460162-207863047 /reut2-021.sgm-441.txt
>> 106460162-207863047 /reut2-013.sgm-298.txt
>> 106460162-207863047 /reut2-013.sgm-995.txt
>> 106460162-207863047 /reut2-015.sgm-521.txt
>> 106460162-207863047 /reut2-020.sgm-162.txt
>> 106460162-207863047
> /reut2-020.sgm-163.txt
>> 106460162-207863047 /reut2-013.sgm-296.txt
>> ...
>> ...
>> 
>> 
>> Is this the correct way of running MinHash.
>> 
>> If yes then I would update the wiki page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
>> the instructions.
>> 
>> Otherwise if someone could tell me on what am I doing wrong.
> 
> I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might be good to incorporate this into the build-reuters.sh as well as try it on some other input.
> 
> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 26, 2011, at 8:51 AM, Suneel Marthi wrote:

> I am still trying to fully understand minHash algorithm and I had the same results like below when running the MinHashDriver.
> 
> I have a use case wherein I need to determine the content similarity of 2 documents like what's been described in Andrei Broder's paper - 'Identifying and Filtering Near-Duplicate Documents' (http://dl.acm.org/citation.cfm?id=736184).  

I've only skimmed this paper, but I get the sense that we are not fully implementing what is in the paper.  I've asked the author of the patch for a citation on the implementation approach, but they haven't responded.  

It would be nice to resolve this, as I tend to agree w/ your assessment that there is something not quite right.  At a minimum, it seems one needs a higher number of key groups.

> 
> I started dissecting the clusters generated by Mahout's MinHashDriver to compare document content equality and to determine how accurate the clustering was?
> I do see that the first 2 files from the output below were put in the same cluster 106460162-207863047; thought the actual text content in both the files is different.  How?
> 
> I am assuming that the NGram attribute was set to the default value of 1 when creating the tf-idf vectors from sequence files.  
> 
> Suneel
> 
> 
> 
> ________________________________
> From: Grant Ingersoll <gs...@apache.org>
> To: user@mahout.apache.org
> Sent: Tuesday, October 25, 2011 5:55 AM
> Subject: Re: MinHash Clustering in Mahout
> 
> 
> On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:
> 
>> I was trying to run the MinHash algorithm on the Reuters data set, so I did
>> the following before running MinHashDriver
>> 
>>    - Get the Reuters dataset
>>    - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>>    reuters-out from reuters-sgm(the downloaded archive)
>>    - Run seqdirectory to convert reuters-out to SequenceFile format
>>    - Run seq2sparse to convert SequenceFiles to sparse vector format
>> 
>> I used these instructions from the K-means clustering wiki page.
>> 
>> This is the command I used to run MinHashDriver
>> 
>> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
>> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
>> 
>> The output file looks something like this:
>> 
>> 106460162-207863047
> /reut2-015.sgm-653.txt
>> 106460162-207863047 /reut2-021.sgm-7.txt
>> 106460162-207863047 /reut2-013.sgm-307.txt
>> 106460162-207863047 /reut2-013.sgm-306.txt
>> 106460162-207863047 /reut2-014.sgm-786.txt
>> 106460162-207863047 /reut2-013.sgm-304.txt
>> 106460162-207863047 /reut2-013.sgm-303.txt
>> 106460162-207863047 /reut2-021.sgm-230.txt
>> 106460162-207863047 /reut2-012.sgm-548.txt
>> 106460162-207863047 /reut2-020.sgm-161.txt
>> 106460162-207863047 /reut2-021.sgm-553.txt
>> 106460162-207863047 /reut2-013.sgm-299.txt
>> 106460162-207863047 /reut2-015.sgm-284.txt
>> 106460162-207863047 /reut2-013.sgm-996.txt
>> 106460162-207863047 /reut2-021.sgm-441.txt
>> 106460162-207863047 /reut2-013.sgm-298.txt
>> 106460162-207863047 /reut2-013.sgm-995.txt
>> 106460162-207863047 /reut2-015.sgm-521.txt
>> 106460162-207863047 /reut2-020.sgm-162.txt
>> 106460162-207863047
> /reut2-020.sgm-163.txt
>> 106460162-207863047 /reut2-013.sgm-296.txt
>> ...
>> ...
>> 
>> 
>> Is this the correct way of running MinHash.
>> 
>> If yes then I would update the wiki page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
>> the instructions.
>> 
>> Otherwise if someone could tell me on what am I doing wrong.
> 
> I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might be good to incorporate this into the build-reuters.sh as well as try it on some other input.
> 
> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 26, 2011, at 10:51 AM, Suneel Marthi wrote:

> I am still trying to fully understand minHash algorithm and I had the same results like below when running the MinHashDriver.
> 
> I have a use case wherein I need to determine the content similarity of 2 documents like what's been described in Andrei Broder's paper - 'Identifying and Filtering Near-Duplicate Documents' (http://dl.acm.org/citation.cfm?id=736184).  
> 
> I started dissecting the clusters generated by Mahout's MinHashDriver to compare document content equality and to determine how accurate the clustering was?
> I do see that the first 2 files from the output below were put in the same cluster 106460162-207863047; thought the actual text content in both the files is different.  How?

What do the vectors of these look like?  If I remember correctly, some of those files don't have much actual text in them, such that I wonder if they are more or less empty.  Running now to check.

> 
> I am assuming that the NGram attribute was set to the default value of 1 when creating the tf-idf vectors from sequence files.  
> 
> Suneel
> 
> 
> 
> ________________________________
> From: Grant Ingersoll <gs...@apache.org>
> To: user@mahout.apache.org
> Sent: Tuesday, October 25, 2011 5:55 AM
> Subject: Re: MinHash Clustering in Mahout
> 
> 
> On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:
> 
>> I was trying to run the MinHash algorithm on the Reuters data set, so I did
>> the following before running MinHashDriver
>> 
>>    - Get the Reuters dataset
>>    - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>>    reuters-out from reuters-sgm(the downloaded archive)
>>    - Run seqdirectory to convert reuters-out to SequenceFile format
>>    - Run seq2sparse to convert SequenceFiles to sparse vector format
>> 
>> I used these instructions from the K-means clustering wiki page.
>> 
>> This is the command I used to run MinHashDriver
>> 
>> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
>> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
>> 
>> The output file looks something like this:
>> 
>> 106460162-207863047
> /reut2-015.sgm-653.txt
>> 106460162-207863047 /reut2-021.sgm-7.txt
>> 106460162-207863047 /reut2-013.sgm-307.txt
>> 106460162-207863047 /reut2-013.sgm-306.txt
>> 106460162-207863047 /reut2-014.sgm-786.txt
>> 106460162-207863047 /reut2-013.sgm-304.txt
>> 106460162-207863047 /reut2-013.sgm-303.txt
>> 106460162-207863047 /reut2-021.sgm-230.txt
>> 106460162-207863047 /reut2-012.sgm-548.txt
>> 106460162-207863047 /reut2-020.sgm-161.txt
>> 106460162-207863047 /reut2-021.sgm-553.txt
>> 106460162-207863047 /reut2-013.sgm-299.txt
>> 106460162-207863047 /reut2-015.sgm-284.txt
>> 106460162-207863047 /reut2-013.sgm-996.txt
>> 106460162-207863047 /reut2-021.sgm-441.txt
>> 106460162-207863047 /reut2-013.sgm-298.txt
>> 106460162-207863047 /reut2-013.sgm-995.txt
>> 106460162-207863047 /reut2-015.sgm-521.txt
>> 106460162-207863047 /reut2-020.sgm-162.txt
>> 106460162-207863047
> /reut2-020.sgm-163.txt
>> 106460162-207863047 /reut2-013.sgm-296.txt
>> ...
>> ...
>> 
>> 
>> Is this the correct way of running MinHash.
>> 
>> If yes then I would update the wiki page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
>> the instructions.
>> 
>> Otherwise if someone could tell me on what am I doing wrong.
> 
> I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might be good to incorporate this into the build-reuters.sh as well as try it on some other input.
> 
> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

I mod'd VectorDump to take in a --filter option (trunk) so that one can print out specific named vectors.  Here's the vecs for the two items you mention after running them through seq2sparse as configured in cluster-reuters.sh:

/reut2-015.sgm-653.txt:{34519:5.532691955566406,3950:4.322702884674072,33805:5.982217311859131,39687:2.6266393661499023,1982:2.714293956756592,7424:4.58083438873291,8110:9.033519744873047,19509:8.782204627990723,14143:2.336308717727661,24982:4.279929161071777,12254:3.6424925327301025,16280:4.5305399894714355}
/reut2-021.sgm-7.txt:{3730:3.051023483276367,7391:7.888387203216553,36570:9.880817413330078,30839:3.5672693252563477,27570:4.24602746963501,20512:3.8622241020202637,20510:6.3944621086120605,5636:9.593134880065918,28018:6.200305938720703,32492:4.762823581695557,5703:3.9078562259674072,2962:2.807265043258667,41625:2.2218031883239746}



On Oct 26, 2011, at 10:51 AM, Suneel Marthi wrote:

> I am still trying to fully understand minHash algorithm and I had the same results like below when running the MinHashDriver.
> 
> I have a use case wherein I need to determine the content similarity of 2 documents like what's been described in Andrei Broder's paper - 'Identifying and Filtering Near-Duplicate Documents' (http://dl.acm.org/citation.cfm?id=736184).  
> 
> I started dissecting the clusters generated by Mahout's MinHashDriver to compare document content equality and to determine how accurate the clustering was?
> I do see that the first 2 files from the output below were put in the same cluster 106460162-207863047; thought the actual text content in both the files is different.  How?
> 
> I am assuming that the NGram attribute was set to the default value of 1 when creating the tf-idf vectors from sequence files.  
> 
> Suneel
> 
> 
> 
> ________________________________
> From: Grant Ingersoll <gs...@apache.org>
> To: user@mahout.apache.org
> Sent: Tuesday, October 25, 2011 5:55 AM
> Subject: Re: MinHash Clustering in Mahout
> 
> 
> On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:
> 
>> I was trying to run the MinHash algorithm on the Reuters data set, so I did
>> the following before running MinHashDriver
>> 
>>    - Get the Reuters dataset
>>    - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>>    reuters-out from reuters-sgm(the downloaded archive)
>>    - Run seqdirectory to convert reuters-out to SequenceFile format
>>    - Run seq2sparse to convert SequenceFiles to sparse vector format
>> 
>> I used these instructions from the K-means clustering wiki page.
>> 
>> This is the command I used to run MinHashDriver
>> 
>> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
>> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
>> 
>> The output file looks something like this:
>> 
>> 106460162-207863047
> /reut2-015.sgm-653.txt
>> 106460162-207863047 /reut2-021.sgm-7.txt
>> 106460162-207863047 /reut2-013.sgm-307.txt
>> 106460162-207863047 /reut2-013.sgm-306.txt
>> 106460162-207863047 /reut2-014.sgm-786.txt
>> 106460162-207863047 /reut2-013.sgm-304.txt
>> 106460162-207863047 /reut2-013.sgm-303.txt
>> 106460162-207863047 /reut2-021.sgm-230.txt
>> 106460162-207863047 /reut2-012.sgm-548.txt
>> 106460162-207863047 /reut2-020.sgm-161.txt
>> 106460162-207863047 /reut2-021.sgm-553.txt
>> 106460162-207863047 /reut2-013.sgm-299.txt
>> 106460162-207863047 /reut2-015.sgm-284.txt
>> 106460162-207863047 /reut2-013.sgm-996.txt
>> 106460162-207863047 /reut2-021.sgm-441.txt
>> 106460162-207863047 /reut2-013.sgm-298.txt
>> 106460162-207863047 /reut2-013.sgm-995.txt
>> 106460162-207863047 /reut2-015.sgm-521.txt
>> 106460162-207863047 /reut2-020.sgm-162.txt
>> 106460162-207863047
> /reut2-020.sgm-163.txt
>> 106460162-207863047 /reut2-013.sgm-296.txt
>> ...
>> ...
>> 
>> 
>> Is this the correct way of running MinHash.
>> 
>> If yes then I would update the wiki page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
>> the instructions.
>> 
>> Otherwise if someone could tell me on what am I doing wrong.
> 
> I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might be good to incorporate this into the build-reuters.sh as well as try it on some other input.
> 
> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

Posted by Suneel Marthi <su...@yahoo.com>.

I am still trying to fully understand minHash algorithm and I had the same results like below when running the MinHashDriver.

I have a use case wherein I need to determine the content similarity of 2 documents like what's been described in Andrei Broder's paper - 'Identifying and Filtering Near-Duplicate Documents' (http://dl.acm.org/citation.cfm?id=736184).  

I started dissecting the clusters generated by Mahout's MinHashDriver to compare document content equality and to determine how accurate the clustering was?
I do see that the first 2 files from the output below were put in the same cluster 106460162-207863047; thought the actual text content in both the files is different.  How?

I am assuming that the NGram attribute was set to the default value of 1 when creating the tf-idf vectors from sequence files.  

Suneel



________________________________
From: Grant Ingersoll <gs...@apache.org>
To: user@mahout.apache.org
Sent: Tuesday, October 25, 2011 5:55 AM
Subject: Re: MinHash Clustering in Mahout


On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:

> I was trying to run the MinHash algorithm on the Reuters data set, so I did
> the following before running MinHashDriver
> 
>   - Get the Reuters dataset
>   - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>   reuters-out from reuters-sgm(the downloaded archive)
>   - Run seqdirectory to convert reuters-out to SequenceFile format
>   - Run seq2sparse to convert SequenceFiles to sparse vector format
> 
> I used these instructions from the K-means clustering wiki page.
> 
> This is the command I used to run MinHashDriver
> 
> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
> 
> The output file looks something like this:
> 
> 106460162-207863047
 /reut2-015.sgm-653.txt
> 106460162-207863047 /reut2-021.sgm-7.txt
> 106460162-207863047 /reut2-013.sgm-307.txt
> 106460162-207863047 /reut2-013.sgm-306.txt
> 106460162-207863047 /reut2-014.sgm-786.txt
> 106460162-207863047 /reut2-013.sgm-304.txt
> 106460162-207863047 /reut2-013.sgm-303.txt
> 106460162-207863047 /reut2-021.sgm-230.txt
> 106460162-207863047 /reut2-012.sgm-548.txt
> 106460162-207863047 /reut2-020.sgm-161.txt
> 106460162-207863047 /reut2-021.sgm-553.txt
> 106460162-207863047 /reut2-013.sgm-299.txt
> 106460162-207863047 /reut2-015.sgm-284.txt
> 106460162-207863047 /reut2-013.sgm-996.txt
> 106460162-207863047 /reut2-021.sgm-441.txt
> 106460162-207863047 /reut2-013.sgm-298.txt
> 106460162-207863047 /reut2-013.sgm-995.txt
> 106460162-207863047 /reut2-015.sgm-521.txt
> 106460162-207863047 /reut2-020.sgm-162.txt
> 106460162-207863047
 /reut2-020.sgm-163.txt
> 106460162-207863047 /reut2-013.sgm-296.txt
> ...
> ...
> 
> 
> Is this the correct way of running MinHash.
> 
> If yes then I would update the wiki page
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
> the instructions.
> 
> Otherwise if someone could tell me on what am I doing wrong.

I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might be good to incorporate this into the build-reuters.sh as well as try it on some other input.

-Grant

Re: MinHash Clustering in Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:

> I was trying to run the MinHash algorithm on the Reuters data set, so I did
> the following before running MinHashDriver
> 
>   - Get the Reuters dataset
>   - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>   reuters-out from reuters-sgm(the downloaded archive)
>   - Run seqdirectory to convert reuters-out to SequenceFile format
>   - Run seq2sparse to convert SequenceFiles to sparse vector format
> 
> I used these instructions from the K-means clustering wiki page.
> 
> This is the command I used to run MinHashDriver
> 
> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
> 
> The output file looks something like this:
> 
> 106460162-207863047 /reut2-015.sgm-653.txt
> 106460162-207863047 /reut2-021.sgm-7.txt
> 106460162-207863047 /reut2-013.sgm-307.txt
> 106460162-207863047 /reut2-013.sgm-306.txt
> 106460162-207863047 /reut2-014.sgm-786.txt
> 106460162-207863047 /reut2-013.sgm-304.txt
> 106460162-207863047 /reut2-013.sgm-303.txt
> 106460162-207863047 /reut2-021.sgm-230.txt
> 106460162-207863047 /reut2-012.sgm-548.txt
> 106460162-207863047 /reut2-020.sgm-161.txt
> 106460162-207863047 /reut2-021.sgm-553.txt
> 106460162-207863047 /reut2-013.sgm-299.txt
> 106460162-207863047 /reut2-015.sgm-284.txt
> 106460162-207863047 /reut2-013.sgm-996.txt
> 106460162-207863047 /reut2-021.sgm-441.txt
> 106460162-207863047 /reut2-013.sgm-298.txt
> 106460162-207863047 /reut2-013.sgm-995.txt
> 106460162-207863047 /reut2-015.sgm-521.txt
> 106460162-207863047 /reut2-020.sgm-162.txt
> 106460162-207863047 /reut2-020.sgm-163.txt
> 106460162-207863047 /reut2-013.sgm-296.txt
> ...
> ...
> 
> 
> Is this the correct way of running MinHash.
> 
> If yes then I would update the wiki page
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
> the instructions.
> 
> Otherwise if someone could tell me on what am I doing wrong.

I haven't looked into the code, but I get similar outputs, so I assume it is working.  Might be good to incorporate this into the build-reuters.sh as well as try it on some other input.

-Grant