You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Elena Smirnova <es...@gmail.com> on 2012/07/30 10:35:09 UTC

Possible issue in MinHashMapper

Hello,

It seems to me that there is an issue in MinHashMapper class. In map
method, the loop goes over the elements in the vector. In many cases the
instance of Vector abstract class is a SparseVector and iteration would
meant to be over non-zeros values (e.g., documents as a sparse vector of
words).  However, in current implementation the iteration will go over all
the elements including zero-valued (as using the vector iterator by
default). This can produce meaningless clustering. In addition, in this
case I think we should hash the index of the element rather than it's
value.

Can somebody confirm or disprove this?

Thanks,

Best regards,
Elena Smirnova.

Re: Possible issue in MinHashMapper

Posted by Sean Owen <sr...@gmail.com>.

Definitely, that works for the example, though it then fails for the unit
test.

On Mon, Jul 30, 2012 at 1:41 PM, Elena Smirnova <es...@gmail.com>wrote:

> OK. But I preferred to modify a couple of lines of MinHashMapper rather
> than writing my own spaseVectors generator:
> ...................
>  for (int i = 0; i < numHashFunctions; i++) {
>             Iterator it = featureVector.iterateNonZero();
>             while (it.hasNext()) {
>                 Vector.Element next = (Vector.Element) it.next();
>                 int value = next.index();
> ..................
>
> It works fine.

Re: Possible issue in MinHashMapper

Posted by Elena Smirnova <es...@gmail.com>.

OK. But I preferred to modify a couple of lines of MinHashMapper rather
than writing my own spaseVectors generator:
...................
 for (int i = 0; i < numHashFunctions; i++) {
            Iterator it = featureVector.iterateNonZero();
            while (it.hasNext()) {
                Vector.Element next = (Vector.Element) it.next();
                int value = next.index();
..................

It works fine.

On Mon, Jul 30, 2012 at 2:27 PM, Sean Owen <sr...@gmail.com> wrote:

> From my reading of the code and the unit test, that is totally different
> from the expected input of MinHashDriver. So I think the example should
> just be deleted.
>
> You would have to convert this to something like:
>
> {0:96037,1:114292,2:119745,...}
>
> to make it work correctly.
>
> On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <esmirnovae@gmail.com
> >wrote:
>
> > The input to the MinHashDriver are tf-vectors, e.g.:
> > hdfs -text /..../tf-vectors/part-r-00000
> > 10020926
> >
> >
> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
> > 10020927
> > {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
> > 10020928
> >
> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
> > 10020929
> >
> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
> >
> > So the vectors are sparse and indexed by wordId. Clearly, this example
> > doesn't fit into current implementation.
> >
> > On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > Hmm yeah what's the output format of that job? At first glance it looks
> > > like it makes a dictionary and fills in sequentially with term
> > frequencies.
> > > If so, applying min-hash this way isn't wildly wrong but I still don't
> > > think it works.
> > >
> > > Can anyone who knows this code at all weigh in? it's not my area...
> > >
> > > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <esmirnovae@gmail.com
> > > >wrote:
> > >
> > > > If vectors are treated as dense, then we have to modify the example
> > given
> > > > for this class, which clearly talks about documents and words:
> > > >
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
> > > >
> > > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <sr...@gmail.com> wrote:
> > > >
> > > > > Yes I know what you mean. In my understanding you typically apply
> > > minhash
> > > > > to a large sparse vector that acts like a bit set, where the index
> is
> > > > > really the set member. There you want to hash the index, and doing
> so
> > > by
> > > > > considering all indices would be completely wrong.
> > > > >
> > > > > Here I think the set elements are the values. and the vectors seem
> to
> > > be
> > > > > treated as a list, really. So I'm not surprised they're treated as
> > > > dense. I
> > > > > still think it's a good idea to iterate over non-default items,
> since
> > > I'm
> > > > > not clear whether the implementation is guaranteed to accept only
> > dense
> > > > > input vectors, where all dimensions have a value -- in which case
> it
> > > > > doesn't matter and the current implementation is OK.
> > > > >
> > > > > Ankur are you still around to answer? I think that's a good guess
> as
> > to
> > > > the
> > > > > original intent.
> > > > >
> > > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <
> > esmirnovae@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > I agree about performance effect of iterating over zeros. But the
> > > > > > correctness effect comes due to hashing values of the element and
> > not
> > > > its
> > > > > > index (at least in documents and words example).
> > > > > >
> > > > > > Do you agree?
> > > > >
> > > >
> > >
> >
>

Re: Possible issue in MinHashMapper

Posted by Elena Smirnova <es...@gmail.com>.

Thanks Suneel,

I think it would also be interesting for Mahout users to have both use
cases. From my experience (Information Retrieval), the use case with
hashing indexes is quite popular..

For example, we can define a parameter in MinHashDriver job that says if we
want to hash values or indexes of the vectors. That way we can keep unit
tests functional and add similar one for hashing indexes. Also we could
also add back the example.

What do you think?


On Tue, Jul 31, 2012 at 12:46 AM, Suneel Marthi <su...@yahoo.com>wrote:

> Done.
>
>
>
> ________________________________
>  From: Suneel Marthi <su...@yahoo.com>
> To: "dev@mahout.apache.org" <de...@mahout.apache.org>
> Sent: Monday, July 30, 2012 6:44 PM
> Subject: Re: Possible issue in MinHashMapper
>
> Sean,
>
> I'll take care of this, I added this sometime last year but was never
> convinced that it ever worked right.
>
>
>
> ________________________________
> From: Sean Owen <sr...@gmail.com>
> To: dev@mahout.apache.org
> Sent: Monday, July 30, 2012 5:18 PM
> Subject: Re: Possible issue in MinHashMapper
>
> (Does anyone have edit rights to delete this wiki please? It appears
> to be just wrong.)
>
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
>
> On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <sr...@gmail.com> wrote:
> > From my reading of the code and the unit test, that is totally different
> > from the expected input of MinHashDriver. So I think the example should
> just
> > be deleted.
> >
> > You would have to convert this to something like:
> >
> > {0:96037,1:114292,2:119745,...}
> >
> > to make it work correctly.
> >
> >
> > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <es...@gmail.com>
> > wrote:
> >>
> >> The input to the MinHashDriver are tf-vectors, e.g.:
> >> hdfs -text /..../tf-vectors/part-r-00000
> >> 10020926
> >>
> >>
> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
> >> 10020927
> >> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
> >> 10020928
> >>
> >>
> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
> >> 10020929
> >>
> >>
> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
> >>
> >> So the vectors are sparse and indexed by wordId. Clearly, this example
> >> doesn't fit into current implementation.
> >>
> >> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <sr...@gmail.com> wrote:
> >>
> >> > Hmm yeah what's the output format of that job? At first glance it
> looks
> >> > like it makes a dictionary and fills in sequentially with term
> >> > frequencies.
> >> > If so, applying min-hash this way isn't wildly wrong but I still don't
> >> > think it works.
> >> >
> >> > Can anyone who knows this code at all weigh in? it's not my area...
> >> >
> >> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <esmirnovae@gmail.com
> >> > >wrote:
> >> >
> >> > > If vectors are treated as dense, then we have to modify the example
> >> > > given
> >> > > for this class, which clearly talks about documents and words:
> >> > >
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
>

Re: Possible issue in MinHashMapper

Posted by Suneel Marthi <su...@yahoo.com>.

Done.



________________________________
 From: Suneel Marthi <su...@yahoo.com>
To: "dev@mahout.apache.org" <de...@mahout.apache.org> 
Sent: Monday, July 30, 2012 6:44 PM
Subject: Re: Possible issue in MinHashMapper
 
Sean,

I'll take care of this, I added this sometime last year but was never convinced that it ever worked right. 



________________________________
From: Sean Owen <sr...@gmail.com>
To: dev@mahout.apache.org 
Sent: Monday, July 30, 2012 5:18 PM
Subject: Re: Possible issue in MinHashMapper

(Does anyone have edit rights to delete this wiki please? It appears
to be just wrong.)

https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <sr...@gmail.com> wrote:
> From my reading of the code and the unit test, that is totally different
> from the expected input of MinHashDriver. So I think the example should just
> be deleted.
>
> You would have to convert this to something like:
>
> {0:96037,1:114292,2:119745,...}
>
> to make it work correctly.
>
>
> On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <es...@gmail.com>
> wrote:
>>
>> The input to the MinHashDriver are tf-vectors, e.g.:
>> hdfs -text /..../tf-vectors/part-r-00000
>> 10020926
>>
>> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
>> 10020927
>> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
>> 10020928
>>
>> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
>> 10020929
>>
>> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
>>
>> So the vectors are sparse and indexed by wordId. Clearly, this example
>> doesn't fit into current implementation.
>>
>> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>> > Hmm yeah what's the output format of that job? At first glance it looks
>> > like it makes a dictionary and fills in sequentially with term
>> > frequencies.
>> > If so, applying min-hash this way isn't wildly wrong but I still don't
>> > think it works.
>> >
>> > Can anyone who knows this code at all weigh in? it's not my area...
>> >
>> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <esmirnovae@gmail.com
>> > >wrote:
>> >
>> > > If vectors are treated as dense, then we have to modify the example
>> > > given
>> > > for this class, which clearly talks about documents and words:
>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

Re: Possible issue in MinHashMapper

Posted by Suneel Marthi <su...@yahoo.com>.

Sean,

I'll take care of this, I added this sometime last year but was never convinced that it ever worked right. 



________________________________
 From: Sean Owen <sr...@gmail.com>
To: dev@mahout.apache.org 
Sent: Monday, July 30, 2012 5:18 PM
Subject: Re: Possible issue in MinHashMapper
 
(Does anyone have edit rights to delete this wiki please? It appears
to be just wrong.)

https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <sr...@gmail.com> wrote:
> From my reading of the code and the unit test, that is totally different
> from the expected input of MinHashDriver. So I think the example should just
> be deleted.
>
> You would have to convert this to something like:
>
> {0:96037,1:114292,2:119745,...}
>
> to make it work correctly.
>
>
> On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <es...@gmail.com>
> wrote:
>>
>> The input to the MinHashDriver are tf-vectors, e.g.:
>> hdfs -text /..../tf-vectors/part-r-00000
>> 10020926
>>
>> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
>> 10020927
>> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
>> 10020928
>>
>> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
>> 10020929
>>
>> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
>>
>> So the vectors are sparse and indexed by wordId. Clearly, this example
>> doesn't fit into current implementation.
>>
>> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>> > Hmm yeah what's the output format of that job? At first glance it looks
>> > like it makes a dictionary and fills in sequentially with term
>> > frequencies.
>> > If so, applying min-hash this way isn't wildly wrong but I still don't
>> > think it works.
>> >
>> > Can anyone who knows this code at all weigh in? it's not my area...
>> >
>> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <esmirnovae@gmail.com
>> > >wrote:
>> >
>> > > If vectors are treated as dense, then we have to modify the example
>> > > given
>> > > for this class, which clearly talks about documents and words:
>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

Re: Possible issue in MinHashMapper

Posted by Sean Owen <sr...@gmail.com>.

(Does anyone have edit rights to delete this wiki please? It appears
to be just wrong.)

https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <sr...@gmail.com> wrote:
> From my reading of the code and the unit test, that is totally different
> from the expected input of MinHashDriver. So I think the example should just
> be deleted.
>
> You would have to convert this to something like:
>
> {0:96037,1:114292,2:119745,...}
>
> to make it work correctly.
>
>
> On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <es...@gmail.com>
> wrote:
>>
>> The input to the MinHashDriver are tf-vectors, e.g.:
>> hdfs -text /..../tf-vectors/part-r-00000
>> 10020926
>>
>> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
>> 10020927
>> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
>> 10020928
>>
>> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
>> 10020929
>>
>> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
>>
>> So the vectors are sparse and indexed by wordId. Clearly, this example
>> doesn't fit into current implementation.
>>
>> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>> > Hmm yeah what's the output format of that job? At first glance it looks
>> > like it makes a dictionary and fills in sequentially with term
>> > frequencies.
>> > If so, applying min-hash this way isn't wildly wrong but I still don't
>> > think it works.
>> >
>> > Can anyone who knows this code at all weigh in? it's not my area...
>> >
>> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <esmirnovae@gmail.com
>> > >wrote:
>> >
>> > > If vectors are treated as dense, then we have to modify the example
>> > > given
>> > > for this class, which clearly talks about documents and words:
>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

Re: Possible issue in MinHashMapper

Posted by Sean Owen <sr...@gmail.com>.

>From my reading of the code and the unit test, that is totally different
from the expected input of MinHashDriver. So I think the example should
just be deleted.

You would have to convert this to something like:

{0:96037,1:114292,2:119745,...}

to make it work correctly.

On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <es...@gmail.com>wrote:

> The input to the MinHashDriver are tf-vectors, e.g.:
> hdfs -text /..../tf-vectors/part-r-00000
> 10020926
>
> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
> 10020927
> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
> 10020928
> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
> 10020929
> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
>
> So the vectors are sparse and indexed by wordId. Clearly, this example
> doesn't fit into current implementation.
>
> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Hmm yeah what's the output format of that job? At first glance it looks
> > like it makes a dictionary and fills in sequentially with term
> frequencies.
> > If so, applying min-hash this way isn't wildly wrong but I still don't
> > think it works.
> >
> > Can anyone who knows this code at all weigh in? it's not my area...
> >
> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <esmirnovae@gmail.com
> > >wrote:
> >
> > > If vectors are treated as dense, then we have to modify the example
> given
> > > for this class, which clearly talks about documents and words:
> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
> > >
> > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <sr...@gmail.com> wrote:
> > >
> > > > Yes I know what you mean. In my understanding you typically apply
> > minhash
> > > > to a large sparse vector that acts like a bit set, where the index is
> > > > really the set member. There you want to hash the index, and doing so
> > by
> > > > considering all indices would be completely wrong.
> > > >
> > > > Here I think the set elements are the values. and the vectors seem to
> > be
> > > > treated as a list, really. So I'm not surprised they're treated as
> > > dense. I
> > > > still think it's a good idea to iterate over non-default items, since
> > I'm
> > > > not clear whether the implementation is guaranteed to accept only
> dense
> > > > input vectors, where all dimensions have a value -- in which case it
> > > > doesn't matter and the current implementation is OK.
> > > >
> > > > Ankur are you still around to answer? I think that's a good guess as
> to
> > > the
> > > > original intent.
> > > >
> > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <
> esmirnovae@gmail.com
> > > > >wrote:
> > > >
> > > > > I agree about performance effect of iterating over zeros. But the
> > > > > correctness effect comes due to hashing values of the element and
> not
> > > its
> > > > > index (at least in documents and words example).
> > > > >
> > > > > Do you agree?
> > > >
> > >
> >
>

Re: Possible issue in MinHashMapper

Posted by Elena Smirnova <es...@gmail.com>.

The input to the MinHashDriver are tf-vectors, e.g.:
hdfs -text /..../tf-vectors/part-r-00000
10020926
{96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
10020927
{53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
10020928
{53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
10020929
{53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}

So the vectors are sparse and indexed by wordId. Clearly, this example
doesn't fit into current implementation.

On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <sr...@gmail.com> wrote:

> Hmm yeah what's the output format of that job? At first glance it looks
> like it makes a dictionary and fills in sequentially with term frequencies.
> If so, applying min-hash this way isn't wildly wrong but I still don't
> think it works.
>
> Can anyone who knows this code at all weigh in? it's not my area...
>
> On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <esmirnovae@gmail.com
> >wrote:
>
> > If vectors are treated as dense, then we have to modify the example given
> > for this class, which clearly talks about documents and words:
> > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
> >
> > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > Yes I know what you mean. In my understanding you typically apply
> minhash
> > > to a large sparse vector that acts like a bit set, where the index is
> > > really the set member. There you want to hash the index, and doing so
> by
> > > considering all indices would be completely wrong.
> > >
> > > Here I think the set elements are the values. and the vectors seem to
> be
> > > treated as a list, really. So I'm not surprised they're treated as
> > dense. I
> > > still think it's a good idea to iterate over non-default items, since
> I'm
> > > not clear whether the implementation is guaranteed to accept only dense
> > > input vectors, where all dimensions have a value -- in which case it
> > > doesn't matter and the current implementation is OK.
> > >
> > > Ankur are you still around to answer? I think that's a good guess as to
> > the
> > > original intent.
> > >
> > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <esmirnovae@gmail.com
> > > >wrote:
> > >
> > > > I agree about performance effect of iterating over zeros. But the
> > > > correctness effect comes due to hashing values of the element and not
> > its
> > > > index (at least in documents and words example).
> > > >
> > > > Do you agree?
> > >
> >
>

Re: Possible issue in MinHashMapper

Posted by Sean Owen <sr...@gmail.com>.

Hmm yeah what's the output format of that job? At first glance it looks
like it makes a dictionary and fills in sequentially with term frequencies.
If so, applying min-hash this way isn't wildly wrong but I still don't
think it works.

Can anyone who knows this code at all weigh in? it's not my area...

On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <es...@gmail.com>wrote:

> If vectors are treated as dense, then we have to modify the example given
> for this class, which clearly talks about documents and words:
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
>
> On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Yes I know what you mean. In my understanding you typically apply minhash
> > to a large sparse vector that acts like a bit set, where the index is
> > really the set member. There you want to hash the index, and doing so by
> > considering all indices would be completely wrong.
> >
> > Here I think the set elements are the values. and the vectors seem to be
> > treated as a list, really. So I'm not surprised they're treated as
> dense. I
> > still think it's a good idea to iterate over non-default items, since I'm
> > not clear whether the implementation is guaranteed to accept only dense
> > input vectors, where all dimensions have a value -- in which case it
> > doesn't matter and the current implementation is OK.
> >
> > Ankur are you still around to answer? I think that's a good guess as to
> the
> > original intent.
> >
> > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <esmirnovae@gmail.com
> > >wrote:
> >
> > > I agree about performance effect of iterating over zeros. But the
> > > correctness effect comes due to hashing values of the element and not
> its
> > > index (at least in documents and words example).
> > >
> > > Do you agree?
> >
>

Re: Possible issue in MinHashMapper

Posted by Elena Smirnova <es...@gmail.com>.

If vectors are treated as dense, then we have to modify the example given
for this class, which clearly talks about documents and words:
https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <sr...@gmail.com> wrote:

> Yes I know what you mean. In my understanding you typically apply minhash
> to a large sparse vector that acts like a bit set, where the index is
> really the set member. There you want to hash the index, and doing so by
> considering all indices would be completely wrong.
>
> Here I think the set elements are the values. and the vectors seem to be
> treated as a list, really. So I'm not surprised they're treated as dense. I
> still think it's a good idea to iterate over non-default items, since I'm
> not clear whether the implementation is guaranteed to accept only dense
> input vectors, where all dimensions have a value -- in which case it
> doesn't matter and the current implementation is OK.
>
> Ankur are you still around to answer? I think that's a good guess as to the
> original intent.
>
> On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <esmirnovae@gmail.com
> >wrote:
>
> > I agree about performance effect of iterating over zeros. But the
> > correctness effect comes due to hashing values of the element and not its
> > index (at least in documents and words example).
> >
> > Do you agree?
>

Re: Possible issue in MinHashMapper

Posted by Sean Owen <sr...@gmail.com>.

Yes I know what you mean. In my understanding you typically apply minhash
to a large sparse vector that acts like a bit set, where the index is
really the set member. There you want to hash the index, and doing so by
considering all indices would be completely wrong.

Here I think the set elements are the values. and the vectors seem to be
treated as a list, really. So I'm not surprised they're treated as dense. I
still think it's a good idea to iterate over non-default items, since I'm
not clear whether the implementation is guaranteed to accept only dense
input vectors, where all dimensions have a value -- in which case it
doesn't matter and the current implementation is OK.

Ankur are you still around to answer? I think that's a good guess as to the
original intent.

On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <es...@gmail.com>wrote:

> I agree about performance effect of iterating over zeros. But the
> correctness effect comes due to hashing values of the element and not its
> index (at least in documents and words example).
>
> Do you agree?

Re: Possible issue in MinHashMapper

Posted by Elena Smirnova <es...@gmail.com>.

I agree about performance effect of iterating over zeros. But the
correctness effect comes due to hashing values of the element and not its
index (at least in documents and words example).

Do you agree?

On Mon, Jul 30, 2012 at 11:58 AM, Sean Owen <sr...@gmail.com> wrote:

> I think that's right, though I think the effect on correctness is quite
> small, but the effect on performance is large. This will always hash zero
> even if zero were not really present in the vector. That is not likely to
> produce the smallest hash value though.
>
> Hashing all those zeroes is wasteful an I'm not clear whether it is
> demanded by the semantics. I suppose I'd also like the author to confirm
> whether this can be changed to look at only non-default values.
>
> On Mon, Jul 30, 2012 at 9:35 AM, Elena Smirnova <esmirnovae@gmail.com
> >wrote:
>
> > Hello,
> >
> > It seems to me that there is an issue in MinHashMapper class. In map
> > method, the loop goes over the elements in the vector. In many cases the
> > instance of Vector abstract class is a SparseVector and iteration would
> > meant to be over non-zeros values (e.g., documents as a sparse vector of
> > words).  However, in current implementation the iteration will go over
> all
> > the elements including zero-valued (as using the vector iterator by
> > default). This can produce meaningless clustering. In addition, in this
> > case I think we should hash the index of the element rather than it's
> > value.
> >
> > Can somebody confirm or disprove this?
> >
> > Thanks,
> >
> > Best regards,
> > Elena Smirnova.
> >
>

Re: Possible issue in MinHashMapper

Posted by Sean Owen <sr...@gmail.com>.

I think that's right, though I think the effect on correctness is quite
small, but the effect on performance is large. This will always hash zero
even if zero were not really present in the vector. That is not likely to
produce the smallest hash value though.

Hashing all those zeroes is wasteful an I'm not clear whether it is
demanded by the semantics. I suppose I'd also like the author to confirm
whether this can be changed to look at only non-default values.

On Mon, Jul 30, 2012 at 9:35 AM, Elena Smirnova <es...@gmail.com>wrote:

> Hello,
>
> It seems to me that there is an issue in MinHashMapper class. In map
> method, the loop goes over the elements in the vector. In many cases the
> instance of Vector abstract class is a SparseVector and iteration would
> meant to be over non-zeros values (e.g., documents as a sparse vector of
> words).  However, in current implementation the iteration will go over all
> the elements including zero-valued (as using the vector iterator by
> default). This can produce meaningless clustering. In addition, in this
> case I think we should hash the index of the element rather than it's
> value.
>
> Can somebody confirm or disprove this?
>
> Thanks,
>
> Best regards,
> Elena Smirnova.
>