You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by ke xie <oe...@gmail.com> on 2011/04/13 09:19:09 UTC

How about a LSH recommender ?

Dear all:

I've read a paper from google, which is about their news recommender system.
They implemented a LSH algorithm to find the closest neibourhoods and the
algorithm is fast for that.

Can we implement one and contribute into the mahout project? Any
suggestions?

paper is here:
http://iws.seu.edu.cn/resource/Proceedings/WWW/2007/papers/paper570.pdf


Cheers


-- 
Name: Ke Xie   Eddy
Research Group of Information Retrieval
State Key Laboratory of Intelligent Technology and Systems
Tsinghua University

Re: How about a LSH recommender ?

Posted by Jake Mannix <ja...@gmail.com>.

You can do LSH on real-valued vectors - the 1's and 0's are just the
+/- signs of projections onto randomly chosen hyperplanes.

Ullman's book is a great reference for this, and also goes over how
to do all the parameter choosing.

On Wed, Apr 13, 2011 at 12:43 AM, ke xie <oe...@gmail.com> wrote:

> Ok, I would try to implement a none-distributed one. Actually I have a
> python version now.
>
> But I have a problem. When doing min-hash, the matrix should be either 1 or
> 0, and then do the hash functions. Then how about rating data? If the
> matrix
> is filled with 1~5 numbers, should we convert them use some treshould and
> convert the rating to 1 if the rating is more than the treshould?
>
> This is the reference I read about LSH. check it out (chapter 3)
> http://infolab.stanford.edu/~ullman/mmds.html
>
> On Wed, Apr 13, 2011 at 3:25 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Sure.
> >
> > LSH is a fine candidate for parallelism and scaling.
> >
> > I would recommend starting small and testing as you go rather than
> leaping
> > into a parallelized full-fledged implementation.  Look for other
> open-source
> > implementaions of LSH algorithms.
> >
> > Be warned that the parameter selection for LSH can be pretty tricky (so I
> > hear, anyway).  You should pick a reasonable and realistic test problem
> so
> > that you can experiment with that.
> >
> >
> > On Wed, Apr 13, 2011 at 12:19 AM, ke xie <oe...@gmail.com> wrote:
> >
> >> Can we implement one and contribute into the mahout project? Any
> >> suggestions?
> >>
> >
> >
>
>
> --
> Name: Ke Xie   Eddy
> Research Group of Information Retrieval
> State Key Laboratory of Intelligent Technology and Systems
> Tsinghua University
>

Re: How about a LSH recommender ?

Posted by ke xie <oe...@gmail.com>.

Ok, I would try to implement a none-distributed one. Actually I have a
python version now.

But I have a problem. When doing min-hash, the matrix should be either 1 or
0, and then do the hash functions. Then how about rating data? If the matrix
is filled with 1~5 numbers, should we convert them use some treshould and
convert the rating to 1 if the rating is more than the treshould?

This is the reference I read about LSH. check it out (chapter 3)
http://infolab.stanford.edu/~ullman/mmds.html

On Wed, Apr 13, 2011 at 3:25 PM, Ted Dunning <te...@gmail.com> wrote:

> Sure.
>
> LSH is a fine candidate for parallelism and scaling.
>
> I would recommend starting small and testing as you go rather than leaping
> into a parallelized full-fledged implementation.  Look for other open-source
> implementaions of LSH algorithms.
>
> Be warned that the parameter selection for LSH can be pretty tricky (so I
> hear, anyway).  You should pick a reasonable and realistic test problem so
> that you can experiment with that.
>
>
> On Wed, Apr 13, 2011 at 12:19 AM, ke xie <oe...@gmail.com> wrote:
>
>> Can we implement one and contribute into the mahout project? Any
>> suggestions?
>>
>
>

-- 
Name: Ke Xie   Eddy
Research Group of Information Retrieval
State Key Laboratory of Intelligent Technology and Systems
Tsinghua University

Re: How about a LSH recommender ?

Posted by Ted Dunning <te...@gmail.com>.

Sure.

LSH is a fine candidate for parallelism and scaling.

I would recommend starting small and testing as you go rather than leaping
into a parallelized full-fledged implementation.  Look for other open-source
implementaions of LSH algorithms.

Be warned that the parameter selection for LSH can be pretty tricky (so I
hear, anyway).  You should pick a reasonable and realistic test problem so
that you can experiment with that.

On Wed, Apr 13, 2011 at 12:19 AM, ke xie <oe...@gmail.com> wrote:

> Can we implement one and contribute into the mahout project? Any
> suggestions?
>

Re: How about a LSH recommender ?

Posted by Benson Margulies <bi...@gmail.com>.

It takes a truly gargantuan amount of data to justify map-reducing
LSH. You can get very far with a plain single-machine implementation.

On Wed, Apr 13, 2011 at 5:57 AM, Sebastian Schelter <ss...@apache.org> wrote:
> They are using PLSI which we already tried to implement in
> https://issues.apache.org/jira/browse/MAHOUT-106. We didn't get it scalable,
> as far as I remember the paper, they are doing a nasty trick when sending
> data to the reducers in a certain step so that they only have to load a
> certain portion of data into memory. I'm not sure this can be replicated in
> hadoop (would love to be proven wrong through).
>
> They are also using LSH to cluster users by jaccard-coefficient, don't we
> already have code for this in org.apache.mahout.clustering.minhash ?
>
> --sebastian
>
> On 13.04.2011 10:49, Sean Owen wrote:
>>
>> One of the three approaches that they combine is latent semantic indexing
>> --
>> that is what I was referring to.
>>
>> On Wed, Apr 13, 2011 at 8:33 AM, Ted Dunning<te...@gmail.com>
>>  wrote:
>>
>>> Sean,
>>>
>>> Do you mean LSI (latent semantic indexing)?  Or LSH (locality sensitive
>>> hashing)?
>>>
>>> (are you a victim of agressive error correction?)
>>>
>>> (or am I the victim of too little?)
>>>
>>>
>
>

Re: How about a LSH recommender ?

Posted by Sebastian Schelter <ss...@apache.org>.

They are using PLSI which we already tried to implement in 
https://issues.apache.org/jira/browse/MAHOUT-106. We didn't get it 
scalable, as far as I remember the paper, they are doing a nasty trick 
when sending data to the reducers in a certain step so that they only 
have to load a certain portion of data into memory. I'm not sure this 
can be replicated in hadoop (would love to be proven wrong through).

They are also using LSH to cluster users by jaccard-coefficient, don't 
we already have code for this in org.apache.mahout.clustering.minhash ?

--sebastian

On 13.04.2011 10:49, Sean Owen wrote:
> One of the three approaches that they combine is latent semantic indexing --
> that is what I was referring to.
>
> On Wed, Apr 13, 2011 at 8:33 AM, Ted Dunning<te...@gmail.com>  wrote:
>
>> Sean,
>>
>> Do you mean LSI (latent semantic indexing)?  Or LSH (locality sensitive
>> hashing)?
>>
>> (are you a victim of agressive error correction?)
>>
>> (or am I the victim of too little?)
>>
>>

Re: How about a LSH recommender ?

Posted by Sean Owen <sr...@gmail.com>.

One of the three approaches that they combine is latent semantic indexing --
that is what I was referring to.

On Wed, Apr 13, 2011 at 8:33 AM, Ted Dunning <te...@gmail.com> wrote:

> Sean,
>
> Do you mean LSI (latent semantic indexing)?  Or LSH (locality sensitive
> hashing)?
>
> (are you a victim of agressive error correction?)
>
> (or am I the victim of too little?)
>
>

Re: How about a LSH recommender ?

Posted by Ted Dunning <te...@gmail.com>.

Sean,

Do you mean LSI (latent semantic indexing)?  Or LSH (locality sensitive
hashing)?

(are you a victim of agressive error correction?)

(or am I the victim of too little?)

On Wed, Apr 13, 2011 at 12:28 AM, Sean Owen <sr...@gmail.com> wrote:

> This approach is really three approaches put together. Elements of two of
> the approaches exist in the project -- recommendations based on
> co-occurrence, and based on clustering (though not MinHash). I don't
> believe
> there's much proper LSI in the project at the moment?
>
> I would steer you towards looking at implementing the pieces of this, which
> are more useful and reusable. Implementing this whole thing is quite a
> large
> project.
>
> On Wed, Apr 13, 2011 at 8:19 AM, ke xie <oe...@gmail.com> wrote:
>
> > Dear all:
> >
> > I've read a paper from google, which is about their news recommender
> > system.
> > They implemented a LSH algorithm to find the closest neibourhoods and the
> > algorithm is fast for that.
> >
> > Can we implement one and contribute into the mahout project? Any
> > suggestions?
> >
> > paper is here:
> > http://iws.seu.edu.cn/resource/Proceedings/WWW/2007/papers/paper570.pdf
> >
> >
> > Cheers
> >
> >
> > --
> > Name: Ke Xie   Eddy
> > Research Group of Information Retrieval
> > State Key Laboratory of Intelligent Technology and Systems
> > Tsinghua University
> >
>

Re: How about a LSH recommender ?

Posted by Sean Owen <sr...@gmail.com>.

This approach is really three approaches put together. Elements of two of
the approaches exist in the project -- recommendations based on
co-occurrence, and based on clustering (though not MinHash). I don't believe
there's much proper LSI in the project at the moment?

I would steer you towards looking at implementing the pieces of this, which
are more useful and reusable. Implementing this whole thing is quite a large
project.

On Wed, Apr 13, 2011 at 8:19 AM, ke xie <oe...@gmail.com> wrote:

> Dear all:
>
> I've read a paper from google, which is about their news recommender
> system.
> They implemented a LSH algorithm to find the closest neibourhoods and the
> algorithm is fast for that.
>
> Can we implement one and contribute into the mahout project? Any
> suggestions?
>
> paper is here:
> http://iws.seu.edu.cn/resource/Proceedings/WWW/2007/papers/paper570.pdf
>
>
> Cheers
>
>
> --
> Name: Ke Xie   Eddy
> Research Group of Information Retrieval
> State Key Laboratory of Intelligent Technology and Systems
> Tsinghua University
>