You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2012/03/05 15:13:34 UTC

Re: Minhash review

I am also curious about the current MinHash implementation. In the
current implementation the vector TF or TF-IDF weights are hashed via
Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
the mailinglist that this is incorrect and the index should be hashed
because the index identifies an N-gram in the dictionary.

However in this blog

http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html

hashing is done directly on the N-gram itself.

How is this algorithm supposed to work? Thoughts?

On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Lance,
>
> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>
>
>
> ________________________________
>  From: Lance Norskog <go...@gmail.com>
> To: dev@mahout.apache.org
> Sent: Monday, January 16, 2012 8:46 PM
> Subject: Re: Minhash review
>
> Minhash works better and better with the more dimensions you throw at
> it, right? All of the Display classes use two dimensions. Would
> MinHash more sense if it uses a few hundred dimensions and then
> collapse down to two? Maybe with SVD?
>
> Are there other clustering algorithms that have this problem?
>
> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>
>> I think before we do 0.6 it could use a few eyeballs.
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com

Re: Minhash review

Posted by Frank Scholten <fr...@frankscholten.nl>.

Hi Suneel,

I was curious about the status of the updated MinHash implementation
you were planning to work on. If you have something that's workable
perhaps an idea to create a Jira ticket?

Cheers,

Frank

On Thu, Mar 8, 2012 at 1:44 PM, Suneel Marthi <su...@yahoo.com> wrote:
> That's correct.
>
>
>
> ________________________________
>  From: Frank Scholten <fr...@frankscholten.nl>
> To: dev@mahout.apache.org
> Sent: Thursday, March 8, 2012 4:17 AM
> Subject: Re: Minhash review
>
> I agree with Grant that it's good to first get a working
> implementation that matches the paper. Later on we can work on other
> approaches.
>
> So if I understand correctly the vectorization step can be skipped and
> we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver
>
> Correct?
>
> On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <su...@yahoo.com> wrote:
>> Frank,
>>
>> I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.
>>
>> I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).
>>
>> I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?
>>
>> Regards.
>>
>>
>>
>> ________________________________
>>  From: Frank Scholten <fr...@frankscholten.nl>
>> To: dev@mahout.apache.org
>> Sent: Monday, March 5, 2012 9:13 AM
>> Subject: Re: Minhash review
>>
>> I am also curious about the current MinHash implementation. In the
>> current implementation the vector TF or TF-IDF weights are hashed via
>> Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
>> the mailinglist that this is incorrect and the index should be hashed
>> because the index identifies an N-gram in the dictionary.
>>
>> However in this blog
>>
>> http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html
>>
>> hashing is done directly on the N-gram itself.
>>
>> How is this algorithm supposed to work? Thoughts?
>>
>> On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
>>> Lance,
>>>
>>> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>>>
>>>
>>>
>>> ________________________________
>>>  From: Lance Norskog <go...@gmail.com>
>>> To: dev@mahout.apache.org
>>> Sent: Monday, January 16, 2012 8:46 PM
>>> Subject: Re: Minhash review
>>>
>>> Minhash works better and better with the more dimensions you throw at
>>> it, right? All of the Display classes use two dimensions. Would
>>> MinHash more sense if it uses a few hundred dimensions and then
>>> collapse down to two? Maybe with SVD?
>>>
>>> Are there other clustering algorithms that have this problem?
>>>
>>> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>>>
>>>> I think before we do 0.6 it could use a few eyeballs.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com

Re: Minhash review

Posted by Suneel Marthi <su...@yahoo.com>.

That's correct.



________________________________
 From: Frank Scholten <fr...@frankscholten.nl>
To: dev@mahout.apache.org 
Sent: Thursday, March 8, 2012 4:17 AM
Subject: Re: Minhash review
 
I agree with Grant that it's good to first get a working
implementation that matches the paper. Later on we can work on other
approaches.

So if I understand correctly the vectorization step can be skipped and
we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver

Correct?

On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Frank,
>
> I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.
>
> I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).
>
> I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?
>
> Regards.
>
>
>
> ________________________________
>  From: Frank Scholten <fr...@frankscholten.nl>
> To: dev@mahout.apache.org
> Sent: Monday, March 5, 2012 9:13 AM
> Subject: Re: Minhash review
>
> I am also curious about the current MinHash implementation. In the
> current implementation the vector TF or TF-IDF weights are hashed via
> Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
> the mailinglist that this is incorrect and the index should be hashed
> because the index identifies an N-gram in the dictionary.
>
> However in this blog
>
> http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html
>
> hashing is done directly on the N-gram itself.
>
> How is this algorithm supposed to work? Thoughts?
>
> On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
>> Lance,
>>
>> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>>
>>
>>
>> ________________________________
>>  From: Lance Norskog <go...@gmail.com>
>> To: dev@mahout.apache.org
>> Sent: Monday, January 16, 2012 8:46 PM
>> Subject: Re: Minhash review
>>
>> Minhash works better and better with the more dimensions you throw at
>> it, right? All of the Display classes use two dimensions. Would
>> MinHash more sense if it uses a few hundred dimensions and then
>> collapse down to two? Maybe with SVD?
>>
>> Are there other clustering algorithms that have this problem?
>>
>> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>>
>>> I think before we do 0.6 it could use a few eyeballs.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com

Re: Minhash review

Posted by Frank Scholten <fr...@frankscholten.nl>.

I agree with Grant that it's good to first get a working
implementation that matches the paper. Later on we can work on other
approaches.

So if I understand correctly the vectorization step can be skipped and
we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver

Correct?

On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Frank,
>
> I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.
>
> I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).
>
> I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?
>
> Regards.
>
>
>
> ________________________________
>  From: Frank Scholten <fr...@frankscholten.nl>
> To: dev@mahout.apache.org
> Sent: Monday, March 5, 2012 9:13 AM
> Subject: Re: Minhash review
>
> I am also curious about the current MinHash implementation. In the
> current implementation the vector TF or TF-IDF weights are hashed via
> Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
> the mailinglist that this is incorrect and the index should be hashed
> because the index identifies an N-gram in the dictionary.
>
> However in this blog
>
> http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html
>
> hashing is done directly on the N-gram itself.
>
> How is this algorithm supposed to work? Thoughts?
>
> On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
>> Lance,
>>
>> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>>
>>
>>
>> ________________________________
>>  From: Lance Norskog <go...@gmail.com>
>> To: dev@mahout.apache.org
>> Sent: Monday, January 16, 2012 8:46 PM
>> Subject: Re: Minhash review
>>
>> Minhash works better and better with the more dimensions you throw at
>> it, right? All of the Display classes use two dimensions. Would
>> MinHash more sense if it uses a few hundred dimensions and then
>> collapse down to two? Maybe with SVD?
>>
>> Are there other clustering algorithms that have this problem?
>>
>> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>>
>>> I think before we do 0.6 it could use a few eyeballs.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com

Re: Minhash review

Posted by Suneel Marthi <su...@yahoo.com>.

Frank,

I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.

I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).

I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?

Regards.

________________________________
 From: Frank Scholten <fr...@frankscholten.nl>
To: dev@mahout.apache.org 
Sent: Monday, March 5, 2012 9:13 AM
Subject: Re: Minhash review

I am also curious about the current MinHash implementation. In the
current implementation the vector TF or TF-IDF weights are hashed via
Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
the mailinglist that this is incorrect and the index should be hashed
because the index identifies an N-gram in the dictionary.

However in this blog

http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html

hashing is done directly on the N-gram itself.

How is this algorithm supposed to work? Thoughts?

On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Lance,
>
> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>
>
>
> ________________________________
>  From: Lance Norskog <go...@gmail.com>
> To: dev@mahout.apache.org
> Sent: Monday, January 16, 2012 8:46 PM
> Subject: Re: Minhash review
>
> Minhash works better and better with the more dimensions you throw at
> it, right? All of the Display classes use two dimensions. Would
> MinHash more sense if it uses a few hundred dimensions and then
> collapse down to two? Maybe with SVD?
>
> Are there other clustering algorithms that have this problem?
>
> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>
>> I think before we do 0.6 it could use a few eyeballs.
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com