You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jeff Hansen <ds...@gmail.com> on 2011/06/10 00:53:23 UTC

MinHash implementation

I'm having a little trouble understanding Mahout's minhash implementation.

Please correct me if I'm wrong, but the general intent of minhash is to
evaluate the similarity of two sparse feature vectors based on the features
they have in common, not necessarily the value of those features (as the
values are often 1 or 0 and 0 values simply aren't tracked in the sparse
vector).  So given a space of 10 dimensions, if Jack had features 4 and 6
and Jill had features 5 and 6, Jacks vector would look something like {4:1,
6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
features in common, their Jaccard coefficient is 1/3.  Also, given K random
hash functions, we would expect about a third of them to return a minimum
value for each of the three keys 4, 5 and 6 and thus about a third of them
would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
third that return a minimum hash value for the key 6).  That's my basic
English explanation of the purpose of minhash -- again, somebody please
correct me if I'm wrong.

Given that understanding, can somebody explain why Mahout's minhash
implentation is hasing the values from the feature vectors rather than the
keys?

See the following code from MinHashMapper.java

    for (int i = 0; i < numHashFunctions; i++) {
      for (Vector.Element ele : featureVector) {

/// Shouldn't the following line say ele.index();
        int value = (int) ele.get();

        bytesToHash[0] = (byte) (value >> 24);
        bytesToHash[1] = (byte) (value >> 16);
        bytesToHash[2] = (byte) (value >> 8);
        bytesToHash[3] = (byte) value;
        int hashIndex = hashFunction[i].hash(bytesToHash);
        if (minHashValues[i] > hashIndex) {
          minHashValues[i] = hashIndex;
        }
      }
    }

The code in TestMinHashClustering also seems to written with the expectation
that minhash should be hashing the values rather than the keys.  Am I
reading something wrong here?  Is this the intended use?  Are we supposed to
be putting the feature ids into the double value fields of the feature
vectors?

Thanks,
Jeff

Re: MinHash implementation

Posted by Sean Owen <sr...@gmail.com>.

It's the right place -- best-effort question-answering is not always that good.

A JIRA is a good thing if you have a specific idea of the issue /
enhancement and ideally a proposed patch. That is tracked with some
loose regularity and so might get more attention.

On Tue, Aug 16, 2011 at 5:13 PM, Jeff Hansen <ds...@gmail.com> wrote:
> I just looked at the initial JIRA to create this implementation and saw the
> example code that uses it --
> https://issues.apache.org/jira/browse/MAHOUT-344
>
> The LastfmDataConverter class is indeed creating a vector with the indices
> stored in the values and spurious information stored in the indices:
>
>        Vector featureVector = new
> SequentialAccessSparseVector(numfeatures);
>        int i = 0;
>        for (Integer feature : itemFeature.getValue()) {
>          featureVector.setQuick(i++, feature);
>        }
>
> As such, that explains why the implementation was written the way it was.  I
> imagine the implementer just didn't understand the point of sparse vectors
> as they might as well have used a dense vector given their implementation.
>  I think this would be best reimplemented using the actual indices of sparse
> vectors for consistency -- most of the utilities to generate vectors follow
> that approach (seq2sparse).
>
> By the way, given that I never got any responses two months ago when I asked
> this question, I'm assuming this is probably the wrong mailing list for
> something like this -- should I have sent this to the developers mailing
> list?  Or would it have been better just to go ahead and submit a JIRA?
>
> Thanks!
>
> On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> I'm not the authoritative voice here, but I would also agree with your
>> interpretation -- it's indices rather than values that I'd use.
>> I can imagine using min-hash on values, but that would not seem to be
>> the most natural thing to do.
>>
>> (I don't understand the comment about set and get(). Vectors aren't
>> sets, and whether it's sparse or not shouldn't decide whether you want
>> values or indices.)
>>
>> On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <li...@gmail.com> wrote:
>> > I think, if your input vector is a set, the ele.get() should be used,
>> > instead, if your input vector is a sparse vector, the ele.index() would
>> be
>> > used.
>> >
>> > Pls correct me if I'm wrong.
>> >
>> >  for (int i = 0; i < numHashFunctions; i++) {
>> >     for (Vector.Element ele : featureVector) {
>> >
>> > /// Shouldn't the following line say ele.index();
>> >       int value = (int) ele.get();
>> >
>> >       bytesToHash[0] = (byte) (value >> 24);
>> >       bytesToHash[1] = (byte) (value >> 16);
>> >       bytesToHash[2] = (byte) (value >> 8);
>> >       bytesToHash[3] = (byte) value;
>> >       int hashIndex = hashFunction[i].hash(bytesToHash);
>> >       if (minHashValues[i] > hashIndex) {
>> >         minHashValues[i] = hashIndex;
>> >       }
>> >     }
>> >   }
>> >
>> > On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <ds...@gmail.com> wrote:
>> >
>> >> I'm having a little trouble understanding Mahout's minhash
>> implementation.
>> >>
>> >> Please correct me if I'm wrong, but the general intent of minhash is to
>> >> evaluate the similarity of two sparse feature vectors based on the
>> features
>> >> they have in common, not necessarily the value of those features (as the
>> >> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
>> >> vector).  So given a space of 10 dimensions, if Jack had features 4 and
>> 6
>> >> and Jill had features 5 and 6, Jacks vector would look something like
>> {4:1,
>> >> 6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
>> >> features in common, their Jaccard coefficient is 1/3.  Also, given K
>> random
>> >> hash functions, we would expect about a third of them to return a
>> minimum
>> >> value for each of the three keys 4, 5 and 6 and thus about a third of
>> them
>> >> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
>> >> third that return a minimum hash value for the key 6).  That's my basic
>> >> English explanation of the purpose of minhash -- again, somebody please
>> >> correct me if I'm wrong.
>> >>
>> >> Given that understanding, can somebody explain why Mahout's minhash
>> >> implentation is hasing the values from the feature vectors rather than
>> the
>> >> keys?
>> >>
>> >> See the following code from MinHashMapper.java
>> >>
>> >>    for (int i = 0; i < numHashFunctions; i++) {
>> >>      for (Vector.Element ele : featureVector) {
>> >>
>> >> /// Shouldn't the following line say ele.index();
>> >>        int value = (int) ele.get();
>> >>
>> >>        bytesToHash[0] = (byte) (value >> 24);
>> >>        bytesToHash[1] = (byte) (value >> 16);
>> >>        bytesToHash[2] = (byte) (value >> 8);
>> >>        bytesToHash[3] = (byte) value;
>> >>        int hashIndex = hashFunction[i].hash(bytesToHash);
>> >>        if (minHashValues[i] > hashIndex) {
>> >>          minHashValues[i] = hashIndex;
>> >>        }
>> >>      }
>> >>    }
>> >>
>> >> The code in TestMinHashClustering also seems to written with the
>> >> expectation
>> >> that minhash should be hashing the values rather than the keys.  Am I
>> >> reading something wrong here?  Is this the intended use?  Are we
>> supposed
>> >> to
>> >> be putting the feature ids into the double value fields of the feature
>> >> vectors?
>> >>
>> >> Thanks,
>> >> Jeff
>> >>
>> >
>> >
>> >
>> > --
>> >
>>
>

Re: MinHash implementation

Posted by Jeff Hansen <ds...@gmail.com>.

I just looked at the initial JIRA to create this implementation and saw the
example code that uses it --
https://issues.apache.org/jira/browse/MAHOUT-344

The LastfmDataConverter class is indeed creating a vector with the indices
stored in the values and spurious information stored in the indices:

        Vector featureVector = new
SequentialAccessSparseVector(numfeatures);
        int i = 0;
        for (Integer feature : itemFeature.getValue()) {
          featureVector.setQuick(i++, feature);
        }

As such, that explains why the implementation was written the way it was.  I
imagine the implementer just didn't understand the point of sparse vectors
as they might as well have used a dense vector given their implementation.
 I think this would be best reimplemented using the actual indices of sparse
vectors for consistency -- most of the utilities to generate vectors follow
that approach (seq2sparse).

By the way, given that I never got any responses two months ago when I asked
this question, I'm assuming this is probably the wrong mailing list for
something like this -- should I have sent this to the developers mailing
list?  Or would it have been better just to go ahead and submit a JIRA?

Thanks!

On Tue, Aug 16, 2011 at 3:08 AM, Sean Owen <sr...@gmail.com> wrote:

> I'm not the authoritative voice here, but I would also agree with your
> interpretation -- it's indices rather than values that I'd use.
> I can imagine using min-hash on values, but that would not seem to be
> the most natural thing to do.
>
> (I don't understand the comment about set and get(). Vectors aren't
> sets, and whether it's sparse or not shouldn't decide whether you want
> values or indices.)
>
> On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <li...@gmail.com> wrote:
> > I think, if your input vector is a set, the ele.get() should be used,
> > instead, if your input vector is a sparse vector, the ele.index() would
> be
> > used.
> >
> > Pls correct me if I'm wrong.
> >
> >  for (int i = 0; i < numHashFunctions; i++) {
> >     for (Vector.Element ele : featureVector) {
> >
> > /// Shouldn't the following line say ele.index();
> >       int value = (int) ele.get();
> >
> >       bytesToHash[0] = (byte) (value >> 24);
> >       bytesToHash[1] = (byte) (value >> 16);
> >       bytesToHash[2] = (byte) (value >> 8);
> >       bytesToHash[3] = (byte) value;
> >       int hashIndex = hashFunction[i].hash(bytesToHash);
> >       if (minHashValues[i] > hashIndex) {
> >         minHashValues[i] = hashIndex;
> >       }
> >     }
> >   }
> >
> > On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <ds...@gmail.com> wrote:
> >
> >> I'm having a little trouble understanding Mahout's minhash
> implementation.
> >>
> >> Please correct me if I'm wrong, but the general intent of minhash is to
> >> evaluate the similarity of two sparse feature vectors based on the
> features
> >> they have in common, not necessarily the value of those features (as the
> >> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
> >> vector).  So given a space of 10 dimensions, if Jack had features 4 and
> 6
> >> and Jill had features 5 and 6, Jacks vector would look something like
> {4:1,
> >> 6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
> >> features in common, their Jaccard coefficient is 1/3.  Also, given K
> random
> >> hash functions, we would expect about a third of them to return a
> minimum
> >> value for each of the three keys 4, 5 and 6 and thus about a third of
> them
> >> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
> >> third that return a minimum hash value for the key 6).  That's my basic
> >> English explanation of the purpose of minhash -- again, somebody please
> >> correct me if I'm wrong.
> >>
> >> Given that understanding, can somebody explain why Mahout's minhash
> >> implentation is hasing the values from the feature vectors rather than
> the
> >> keys?
> >>
> >> See the following code from MinHashMapper.java
> >>
> >>    for (int i = 0; i < numHashFunctions; i++) {
> >>      for (Vector.Element ele : featureVector) {
> >>
> >> /// Shouldn't the following line say ele.index();
> >>        int value = (int) ele.get();
> >>
> >>        bytesToHash[0] = (byte) (value >> 24);
> >>        bytesToHash[1] = (byte) (value >> 16);
> >>        bytesToHash[2] = (byte) (value >> 8);
> >>        bytesToHash[3] = (byte) value;
> >>        int hashIndex = hashFunction[i].hash(bytesToHash);
> >>        if (minHashValues[i] > hashIndex) {
> >>          minHashValues[i] = hashIndex;
> >>        }
> >>      }
> >>    }
> >>
> >> The code in TestMinHashClustering also seems to written with the
> >> expectation
> >> that minhash should be hashing the values rather than the keys.  Am I
> >> reading something wrong here?  Is this the intended use?  Are we
> supposed
> >> to
> >> be putting the feature ids into the double value fields of the feature
> >> vectors?
> >>
> >> Thanks,
> >> Jeff
> >>
> >
> >
> >
> > --
> >
>

Re: MinHash implementation

Posted by Sean Owen <sr...@gmail.com>.

I'm not the authoritative voice here, but I would also agree with your
interpretation -- it's indices rather than values that I'd use.
I can imagine using min-hash on values, but that would not seem to be
the most natural thing to do.

(I don't understand the comment about set and get(). Vectors aren't
sets, and whether it's sparse or not shouldn't decide whether you want
values or indices.)

On Tue, Aug 16, 2011 at 7:23 AM, 刘鎏 <li...@gmail.com> wrote:
> I think, if your input vector is a set, the ele.get() should be used,
> instead, if your input vector is a sparse vector, the ele.index() would be
> used.
>
> Pls correct me if I'm wrong.
>
>  for (int i = 0; i < numHashFunctions; i++) {
>     for (Vector.Element ele : featureVector) {
>
> /// Shouldn't the following line say ele.index();
>       int value = (int) ele.get();
>
>       bytesToHash[0] = (byte) (value >> 24);
>       bytesToHash[1] = (byte) (value >> 16);
>       bytesToHash[2] = (byte) (value >> 8);
>       bytesToHash[3] = (byte) value;
>       int hashIndex = hashFunction[i].hash(bytesToHash);
>       if (minHashValues[i] > hashIndex) {
>         minHashValues[i] = hashIndex;
>       }
>     }
>   }
>
> On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <ds...@gmail.com> wrote:
>
>> I'm having a little trouble understanding Mahout's minhash implementation.
>>
>> Please correct me if I'm wrong, but the general intent of minhash is to
>> evaluate the similarity of two sparse feature vectors based on the features
>> they have in common, not necessarily the value of those features (as the
>> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
>> vector).  So given a space of 10 dimensions, if Jack had features 4 and 6
>> and Jill had features 5 and 6, Jacks vector would look something like {4:1,
>> 6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
>> features in common, their Jaccard coefficient is 1/3.  Also, given K random
>> hash functions, we would expect about a third of them to return a minimum
>> value for each of the three keys 4, 5 and 6 and thus about a third of them
>> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
>> third that return a minimum hash value for the key 6).  That's my basic
>> English explanation of the purpose of minhash -- again, somebody please
>> correct me if I'm wrong.
>>
>> Given that understanding, can somebody explain why Mahout's minhash
>> implentation is hasing the values from the feature vectors rather than the
>> keys?
>>
>> See the following code from MinHashMapper.java
>>
>>    for (int i = 0; i < numHashFunctions; i++) {
>>      for (Vector.Element ele : featureVector) {
>>
>> /// Shouldn't the following line say ele.index();
>>        int value = (int) ele.get();
>>
>>        bytesToHash[0] = (byte) (value >> 24);
>>        bytesToHash[1] = (byte) (value >> 16);
>>        bytesToHash[2] = (byte) (value >> 8);
>>        bytesToHash[3] = (byte) value;
>>        int hashIndex = hashFunction[i].hash(bytesToHash);
>>        if (minHashValues[i] > hashIndex) {
>>          minHashValues[i] = hashIndex;
>>        }
>>      }
>>    }
>>
>> The code in TestMinHashClustering also seems to written with the
>> expectation
>> that minhash should be hashing the values rather than the keys.  Am I
>> reading something wrong here?  Is this the intended use?  Are we supposed
>> to
>> be putting the feature ids into the double value fields of the feature
>> vectors?
>>
>> Thanks,
>> Jeff
>>
>
>
>
> --
>

Re: MinHash implementation

Posted by 刘鎏 <li...@gmail.com>.

I think, if your input vector is a set, the ele.get() should be used,
instead, if your input vector is a sparse vector, the ele.index() would be
used.

Pls correct me if I'm wrong.

 for (int i = 0; i < numHashFunctions; i++) {
     for (Vector.Element ele : featureVector) {

/// Shouldn't the following line say ele.index();
       int value = (int) ele.get();

       bytesToHash[0] = (byte) (value >> 24);
       bytesToHash[1] = (byte) (value >> 16);
       bytesToHash[2] = (byte) (value >> 8);
       bytesToHash[3] = (byte) value;
       int hashIndex = hashFunction[i].hash(bytesToHash);
       if (minHashValues[i] > hashIndex) {
         minHashValues[i] = hashIndex;
       }
     }
   }

On Fri, Jun 10, 2011 at 6:53 AM, Jeff Hansen <ds...@gmail.com> wrote:

> I'm having a little trouble understanding Mahout's minhash implementation.
>
> Please correct me if I'm wrong, but the general intent of minhash is to
> evaluate the similarity of two sparse feature vectors based on the features
> they have in common, not necessarily the value of those features (as the
> values are often 1 or 0 and 0 values simply aren't tracked in the sparse
> vector).  So given a space of 10 dimensions, if Jack had features 4 and 6
> and Jill had features 5 and 6, Jacks vector would look something like {4:1,
> 6:1} and Jill's would like like {5:1, 6:1}.  Since they have 1/3 total
> features in common, their Jaccard coefficient is 1/3.  Also, given K random
> hash functions, we would expect about a third of them to return a minimum
> value for each of the three keys 4, 5 and 6 and thus about a third of them
> would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
> third that return a minimum hash value for the key 6).  That's my basic
> English explanation of the purpose of minhash -- again, somebody please
> correct me if I'm wrong.
>
> Given that understanding, can somebody explain why Mahout's minhash
> implentation is hasing the values from the feature vectors rather than the
> keys?
>
> See the following code from MinHashMapper.java
>
>    for (int i = 0; i < numHashFunctions; i++) {
>      for (Vector.Element ele : featureVector) {
>
> /// Shouldn't the following line say ele.index();
>        int value = (int) ele.get();
>
>        bytesToHash[0] = (byte) (value >> 24);
>        bytesToHash[1] = (byte) (value >> 16);
>        bytesToHash[2] = (byte) (value >> 8);
>        bytesToHash[3] = (byte) value;
>        int hashIndex = hashFunction[i].hash(bytesToHash);
>        if (minHashValues[i] > hashIndex) {
>          minHashValues[i] = hashIndex;
>        }
>      }
>    }
>
> The code in TestMinHashClustering also seems to written with the
> expectation
> that minhash should be hashing the values rather than the keys.  Am I
> reading something wrong here?  Is this the intended use?  Are we supposed
> to
> be putting the feature ids into the double value fields of the feature
> vectors?
>
> Thanks,
> Jeff
>



--