You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2012/01/13 14:53:04 UTC

Minhash review

I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.

I think before we do 0.6 it could use a few eyeballs.

Re: Minhash review

Posted by Frank Scholten <fr...@frankscholten.nl>.

Hi Suneel,

I was curious about the status of the updated MinHash implementation
you were planning to work on. If you have something that's workable
perhaps an idea to create a Jira ticket?

Cheers,

Frank

On Thu, Mar 8, 2012 at 1:44 PM, Suneel Marthi <su...@yahoo.com> wrote:
> That's correct.
>
>
>
> ________________________________
>  From: Frank Scholten <fr...@frankscholten.nl>
> To: dev@mahout.apache.org
> Sent: Thursday, March 8, 2012 4:17 AM
> Subject: Re: Minhash review
>
> I agree with Grant that it's good to first get a working
> implementation that matches the paper. Later on we can work on other
> approaches.
>
> So if I understand correctly the vectorization step can be skipped and
> we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver
>
> Correct?
>
> On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <su...@yahoo.com> wrote:
>> Frank,
>>
>> I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.
>>
>> I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).
>>
>> I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?
>>
>> Regards.
>>
>>
>>
>> ________________________________
>>  From: Frank Scholten <fr...@frankscholten.nl>
>> To: dev@mahout.apache.org
>> Sent: Monday, March 5, 2012 9:13 AM
>> Subject: Re: Minhash review
>>
>> I am also curious about the current MinHash implementation. In the
>> current implementation the vector TF or TF-IDF weights are hashed via
>> Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
>> the mailinglist that this is incorrect and the index should be hashed
>> because the index identifies an N-gram in the dictionary.
>>
>> However in this blog
>>
>> http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html
>>
>> hashing is done directly on the N-gram itself.
>>
>> How is this algorithm supposed to work? Thoughts?
>>
>> On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
>>> Lance,
>>>
>>> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>>>
>>>
>>>
>>> ________________________________
>>>  From: Lance Norskog <go...@gmail.com>
>>> To: dev@mahout.apache.org
>>> Sent: Monday, January 16, 2012 8:46 PM
>>> Subject: Re: Minhash review
>>>
>>> Minhash works better and better with the more dimensions you throw at
>>> it, right? All of the Display classes use two dimensions. Would
>>> MinHash more sense if it uses a few hundred dimensions and then
>>> collapse down to two? Maybe with SVD?
>>>
>>> Are there other clustering algorithms that have this problem?
>>>
>>> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>>>
>>>> I think before we do 0.6 it could use a few eyeballs.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com

Re: Minhash review

Posted by Suneel Marthi <su...@yahoo.com>.

That's correct.



________________________________
 From: Frank Scholten <fr...@frankscholten.nl>
To: dev@mahout.apache.org 
Sent: Thursday, March 8, 2012 4:17 AM
Subject: Re: Minhash review
 
I agree with Grant that it's good to first get a working
implementation that matches the paper. Later on we can work on other
approaches.

So if I understand correctly the vectorization step can be skipped and
we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver

Correct?

On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Frank,
>
> I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.
>
> I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).
>
> I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?
>
> Regards.
>
>
>
> ________________________________
>  From: Frank Scholten <fr...@frankscholten.nl>
> To: dev@mahout.apache.org
> Sent: Monday, March 5, 2012 9:13 AM
> Subject: Re: Minhash review
>
> I am also curious about the current MinHash implementation. In the
> current implementation the vector TF or TF-IDF weights are hashed via
> Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
> the mailinglist that this is incorrect and the index should be hashed
> because the index identifies an N-gram in the dictionary.
>
> However in this blog
>
> http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html
>
> hashing is done directly on the N-gram itself.
>
> How is this algorithm supposed to work? Thoughts?
>
> On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
>> Lance,
>>
>> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>>
>>
>>
>> ________________________________
>>  From: Lance Norskog <go...@gmail.com>
>> To: dev@mahout.apache.org
>> Sent: Monday, January 16, 2012 8:46 PM
>> Subject: Re: Minhash review
>>
>> Minhash works better and better with the more dimensions you throw at
>> it, right? All of the Display classes use two dimensions. Would
>> MinHash more sense if it uses a few hundred dimensions and then
>> collapse down to two? Maybe with SVD?
>>
>> Are there other clustering algorithms that have this problem?
>>
>> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>>
>>> I think before we do 0.6 it could use a few eyeballs.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com

Re: Minhash review

Posted by Frank Scholten <fr...@frankscholten.nl>.

I agree with Grant that it's good to first get a working
implementation that matches the paper. Later on we can work on other
approaches.

So if I understand correctly the vectorization step can be skipped and
we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver

Correct?

On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Frank,
>
> I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.
>
> I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).
>
> I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?
>
> Regards.
>
>
>
> ________________________________
>  From: Frank Scholten <fr...@frankscholten.nl>
> To: dev@mahout.apache.org
> Sent: Monday, March 5, 2012 9:13 AM
> Subject: Re: Minhash review
>
> I am also curious about the current MinHash implementation. In the
> current implementation the vector TF or TF-IDF weights are hashed via
> Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
> the mailinglist that this is incorrect and the index should be hashed
> because the index identifies an N-gram in the dictionary.
>
> However in this blog
>
> http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html
>
> hashing is done directly on the N-gram itself.
>
> How is this algorithm supposed to work? Thoughts?
>
> On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
>> Lance,
>>
>> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>>
>>
>>
>> ________________________________
>>  From: Lance Norskog <go...@gmail.com>
>> To: dev@mahout.apache.org
>> Sent: Monday, January 16, 2012 8:46 PM
>> Subject: Re: Minhash review
>>
>> Minhash works better and better with the more dimensions you throw at
>> it, right? All of the Display classes use two dimensions. Would
>> MinHash more sense if it uses a few hundred dimensions and then
>> collapse down to two? Maybe with SVD?
>>
>> Are there other clustering algorithms that have this problem?
>>
>> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>>
>>> I think before we do 0.6 it could use a few eyeballs.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com

Re: Minhash review

Posted by Suneel Marthi <su...@yahoo.com>.

Frank,

I modified the present MinHash to hash on the index as opposed to the present tf-idf weights, but the change had no impact on the output and I still get bad clusters.

I did read the blog posting you mention and that seems to be the right approach (and conforms to Broder's original paper on this subject - http://dl.acm.org/citation.cfm?id=736184).

I can work on this. Do we modify the existing minhash code to be compliant with Broder's paper or do we implement a different MinHash based on Broder's paper?

Regards.

________________________________
 From: Frank Scholten <fr...@frankscholten.nl>
To: dev@mahout.apache.org 
Sent: Monday, March 5, 2012 9:13 AM
Subject: Re: Minhash review

I am also curious about the current MinHash implementation. In the
current implementation the vector TF or TF-IDF weights are hashed via
Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
the mailinglist that this is incorrect and the index should be hashed
because the index identifies an N-gram in the dictionary.

However in this blog

http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html

hashing is done directly on the N-gram itself.

How is this algorithm supposed to work? Thoughts?

On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Lance,
>
> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>
>
>
> ________________________________
>  From: Lance Norskog <go...@gmail.com>
> To: dev@mahout.apache.org
> Sent: Monday, January 16, 2012 8:46 PM
> Subject: Re: Minhash review
>
> Minhash works better and better with the more dimensions you throw at
> it, right? All of the Display classes use two dimensions. Would
> MinHash more sense if it uses a few hundred dimensions and then
> collapse down to two? Maybe with SVD?
>
> Are there other clustering algorithms that have this problem?
>
> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>
>> I think before we do 0.6 it could use a few eyeballs.
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com

Re: Minhash review

Posted by Frank Scholten <fr...@frankscholten.nl>.

I am also curious about the current MinHash implementation. In the
current implementation the vector TF or TF-IDF weights are hashed via
Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
the mailinglist that this is incorrect and the index should be hashed
because the index identifies an N-gram in the dictionary.

However in this blog

http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html

hashing is done directly on the N-gram itself.

How is this algorithm supposed to work? Thoughts?

On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Lance,
>
> I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.
>
>
>
> ________________________________
>  From: Lance Norskog <go...@gmail.com>
> To: dev@mahout.apache.org
> Sent: Monday, January 16, 2012 8:46 PM
> Subject: Re: Minhash review
>
> Minhash works better and better with the more dimensions you throw at
> it, right? All of the Display classes use two dimensions. Would
> MinHash more sense if it uses a few hundred dimensions and then
> collapse down to two? Maybe with SVD?
>
> Are there other clustering algorithms that have this problem?
>
> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>>
>> I think before we do 0.6 it could use a few eyeballs.
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com

Re: Minhash review

Posted by Suneel Marthi <su...@yahoo.com>.

Lance,

I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I am playing with the the keyGroups values to determine if that improves the quality of clustering.

________________________________
 From: Lance Norskog <go...@gmail.com>
To: dev@mahout.apache.org 
Sent: Monday, January 16, 2012 8:46 PM
Subject: Re: Minhash review

Minhash works better and better with the more dimensions you throw at
it, right? All of the Display classes use two dimensions. Would
MinHash more sense if it uses a few hundred dimensions and then
collapse down to two? Maybe with SVD?

Are there other clustering algorithms that have this problem?

On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>
> I think before we do 0.6 it could use a few eyeballs.
>
>

-- 
Lance Norskog
goksron@gmail.com

Re: Minhash review

Posted by Lance Norskog <go...@gmail.com>.

Minhash works better and better with the more dimensions you throw at
it, right? All of the Display classes use two dimensions. Would
MinHash more sense if it uses a few hundred dimensions and then
collapse down to two? Maybe with SVD?

Are there other clustering algorithms that have this problem?

On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gs...@apache.org> wrote:
> I've had a sneaking suspicion for a while now that our minhash clustering isn't right.  Looking at the DisplayMinHash contributed issue further cements this feeling, but I can't quite put my finger on what is wrong.  I don't think it is completely true to the Broder paper, but that doesn't necessarily make it wrong.  It's just both the cluster-reuters output and the DisplayMinHash output seem to be of pretty low quality.  My gut says it has to do with the group stuff whereby we create the signatures.
>
> I think before we do 0.6 it could use a few eyeballs.
>
>



-- 
Lance Norskog
goksron@gmail.com