You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Atul Kulkarni <at...@gmail.com> on 2009/04/01 08:09:19 UTC

Re: [gsoc] Collaborative filtering algorithms

>I agree that getting a parallel SVD running is in and of itself
>probably a good project in terms of size. On the other hand it would
>be better to end up with a basic recommender as a final product. But
>even if SVD by itself doesn't make up a complete unit by itself for
>collaborative filtering purposes, it does seem interesting enough as a
>unit within the broader mandate of Mahout as a machine learning
>project. So I personally could support this as a project indeed.

>I suppose I'd say the first step is to see if anyone's done SVD on
>Hadoop yet, and if so, finish the recommender. If not, SVD is useful
>by itself.

I would like to take this thread forward, I did not see any discussion on
this in the archive (at least under this subject or by the authors) and
hence the questions. I am interested in implementing a MapReduce based
Collaborative Filtering approach.

My background:

I am Atul Kulkarni Master's student at University of Minnesota Duluth,
working on the Netflix Prize problem for my graduate project. I have seen
and am implementing for my Machine Learning class project some of the
proposed methods for solving Netflix preize problem using Collaborative
Filtering methods described by [1], and [3]. For my gradute desertation
project of netflix problem, I am working under Dr. Richard Maclin. I am
implementing a variant of this method that uses predictions based on data
from external data sources about movies and forming representative users and
applying nearest neighbor algorithms to predict the rating for the active
user.

I have done distributed programing using C/MPI for the architecture class
and implemented the "Top ranked Phreases in Corpus" project which needed us
to find top R phrases of word length N form the GigaWord Corpus. The project
details and the project report can be found at
http://trpc.sourceforge.net/trpc/Home.html

I do not have extensive experience of Hadoop but I have seen the MapReduce
frame work videos by Google, and am reading the paper [4] to gain knowledge
about MapReduce.

My Questions:

Is there anyone doing the SVD part or are their any SVD algorithm
implementation on Hadoop? If there are then I would like to implement the
methods described in [1],[2],[3] for matrix factorization, in specific. I
found one paper parallel implementation for Netflix prize problem by [5],
but I am not sure if one already has some implementations in place. If there
is no implementation of SVD on Hadoop I would like to implement one if no
one else is doing it and implement methods described by [1] as one of the
recommender for Netflix problem.

I have worked with Netflix Prize problem and hence most of my suggested
algorithms revolve around that problem. But I am open to other algorithms
that might be out there. Is this a good thing to do?


References:

1. A. Paterek. Improving regularized singular value decomposition for
collaborative ltering.
In Proc. of KDD Cup Workshop at SIGKDD'07, 13th ACM Int. Conf. on Knowledge
Discovery and Data Mining,
pages 39{42, San Jose, CA, USA, 2007.

2. *Gábor Takács, István Pilászy, Bottyán Németh, Domonkos Tikk. *Scalable
Collaborative Filtering Approaches for Large Recommender Systems
*JMLR*10(Mar):623--656, 2009.

3. http://sifter.org/~simon/journal/20061211.html

4. J. Dean S. Ghemawat. MapReduce: Simplified Data Processing on Large
Clusters, OSDI'04:
Sixth Symposium on Operating System Design and Implementation, San
Francisco, CA, December, 2004.

5. Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan,
Large-Scale Parallel Collaborative Filtering for the Netflix Prize, LNCS,
Algorithmic Aspects in Information and Management, Volume 5034/2008. (I
could not find the exact reference to this paper, it can be found
here<http://www.springerlink.com/content/j1076u0h14586183/>
).

-- 
Regards,
Atul Kulkarni
Teaching Assistant,
Department of Computer Science,
University of Minnesota Duluth
Duluth. 55805.

Re: [gsoc] Collaborative filtering algorithms

Posted by Atul Kulkarni <at...@gmail.com>.

Thanks David, that helped.


On Wed, Apr 1, 2009 at 1:47 AM, David Hall <dl...@cs.stanford.edu> wrote:

> On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni <at...@gmail.com>
> wrote:
> > questions in line.
> >
> > On Wed, Apr 1, 2009 at 1:27 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Nobody is working on SVD yet, but one GSOC applicant has said that they
> >> would like to work on LDA which is a probabilistic relative of SVD.
> >>
> > I do not understand the relation in LDA and SVD. In my limited
> understanding
> > I understand LDA transforms data points in to a coordinate system  where
> > they can be easily discriminated/classified. SVD on the other hand is
> used
> > for dimension reduction, can you help me bridge the gap by providing
> > something to read on?
>
> LDA is an overloaded term. To the frequentist, it usually means Linear
> Discriminant Analysis, which is what you're talking about; to the
> bayesian machine learning people, it means Latent Dirichlet
> Allocation, which is a probabilistic dimensionality reduction
> technique for projecting documents in V-dimensional space to the
> K-simplex, with K \ll V.
>
> -- David
>
> >
> >
> >> The approach in your reference (3) is highly amenable to parallel
> >> implementation.
> >
> > Yes, I felt so too, but again did not want to comment on it untill I had
> the
> > MapReduce basics related with it.
> >
> >>
> >>
> >> Large-scale SVD would be a very interesting application for Mahout.
> >>
> >
> >>
> >> On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni <
> atulskulkarni@gmail.com
> >> >wrote:
> >>
> >> > Is there anyone doing the SVD part or are their any SVD algorithm
> >> > implementation on Hadoop? If there are then I would like to implement
> the
> >> > methods described in [1],[2],[3] for matrix factorization, in
> specific.
> >> >
> >>
> >>
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >>
> >
> >
> >
> > --
> > Regards,
> > Atul Kulkarni
> > Teaching Assistant,
> > Department of Computer Science,
> > University of Minnesota Duluth
> > Duluth. 55805.
> > www.d.umn.edu/~kulka053 <http://www.d.umn.edu/%7Ekulka053>
> >
>



-- 
Regards,
Atul Kulkarni
Teaching Assistant,
Department of Computer Science,
University of Minnesota Duluth
Duluth. 55805.
www.d.umn.edu/~kulka053

Re: [gsoc] Collaborative filtering algorithms

Posted by David Hall <dl...@cs.stanford.edu>.

On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni <at...@gmail.com> wrote:
> questions in line.
>
> On Wed, Apr 1, 2009 at 1:27 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> Nobody is working on SVD yet, but one GSOC applicant has said that they
>> would like to work on LDA which is a probabilistic relative of SVD.
>>
> I do not understand the relation in LDA and SVD. In my limited understanding
> I understand LDA transforms data points in to a coordinate system  where
> they can be easily discriminated/classified. SVD on the other hand is used
> for dimension reduction, can you help me bridge the gap by providing
> something to read on?

LDA is an overloaded term. To the frequentist, it usually means Linear
Discriminant Analysis, which is what you're talking about; to the
bayesian machine learning people, it means Latent Dirichlet
Allocation, which is a probabilistic dimensionality reduction
technique for projecting documents in V-dimensional space to the
K-simplex, with K \ll V.

-- David

>
>
>> The approach in your reference (3) is highly amenable to parallel
>> implementation.
>
> Yes, I felt so too, but again did not want to comment on it untill I had the
> MapReduce basics related with it.
>
>>
>>
>> Large-scale SVD would be a very interesting application for Mahout.
>>
>
>>
>> On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni <atulskulkarni@gmail.com
>> >wrote:
>>
>> > Is there anyone doing the SVD part or are their any SVD algorithm
>> > implementation on Hadoop? If there are then I would like to implement the
>> > methods described in [1],[2],[3] for matrix factorization, in specific.
>> >
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
>
>
> --
> Regards,
> Atul Kulkarni
> Teaching Assistant,
> Department of Computer Science,
> University of Minnesota Duluth
> Duluth. 55805.
> www.d.umn.edu/~kulka053
>

Re: [gsoc] Collaborative filtering algorithms

Posted by Ted Dunning <te...@gmail.com>.

The machinery of SVD is almost always described in terms of least squares
matrix approximation without mentioning the probabilistic underpinnings of
why least-squares is a good idea.  The connection, however, goes all the way
back to Gauss' reduction of planetary position observations (this is *why*
the normal distribution is often called a Gaussian).  Gauss provided such a
compelling rationale for both the normal distribution (what I called a
Gaussian below) and the resulting least squared error formulation of the
estimation problem that everybody has just assumed that least-squared-error
estimation is the way to go.  Generally this is a pretty good
approximation.  Occasionally it is not at all good.  One place where it is a
really bad approximation is with very sparse count data.  Netflix data is a
great example, text represented as word counts per document is another.

To fill in more detail, here is a relatively jargon-filled explanation of
the connection.  I apologize for not being able to express this more
lucidly.

A more general view of both SVD and LDA are that they find probabilistic
mixture models to describe data.   SVD finds a single mixture of Gaussian
distributions that all have the same variance and uses maximum likelihood to
find this mixture.  LDA finds a multi-level mixture of multinomial models
and gives you a distribution of models that represents the distribution of
possible models given your data and explicit assumptions.

Gaussian distributions and multinomials look quite different, but for
relatively large observed counts their log-likelihood functions become very
similar.  For Gaussians, the log-likelihood is just the sum of squared
deviations from the mean.  For large counts, the log-likelihood for
multinomials approximates squared deviations from the mean.

On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni <at...@gmail.com>wrote:

> I do not understand the relation in LDA and SVD. In my limited
> understanding
> I understand LDA transforms data points in to a coordinate system  where
> they can be easily discriminated/classified. SVD on the other hand is used
> for dimension reduction, can you help me bridge the gap by providing
> something to read on?

-- 
Ted Dunning, CTO
DeepDyve

Re: [gsoc] Collaborative filtering algorithms

Posted by Atul Kulkarni <at...@gmail.com>.

questions in line.

On Wed, Apr 1, 2009 at 1:27 AM, Ted Dunning <te...@gmail.com> wrote:

> Nobody is working on SVD yet, but one GSOC applicant has said that they
> would like to work on LDA which is a probabilistic relative of SVD.
>
I do not understand the relation in LDA and SVD. In my limited understanding
I understand LDA transforms data points in to a coordinate system  where
they can be easily discriminated/classified. SVD on the other hand is used
for dimension reduction, can you help me bridge the gap by providing
something to read on?


> The approach in your reference (3) is highly amenable to parallel
> implementation.

Yes, I felt so too, but again did not want to comment on it untill I had the
MapReduce basics related with it.

>
>
> Large-scale SVD would be a very interesting application for Mahout.
>

>
> On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni <atulskulkarni@gmail.com
> >wrote:
>
> > Is there anyone doing the SVD part or are their any SVD algorithm
> > implementation on Hadoop? If there are then I would like to implement the
> > methods described in [1],[2],[3] for matrix factorization, in specific.
> >
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Regards,
Atul Kulkarni
Teaching Assistant,
Department of Computer Science,
University of Minnesota Duluth
Duluth. 55805.
www.d.umn.edu/~kulka053

Re: [gsoc] Collaborative filtering algorithms

Posted by Ted Dunning <te...@gmail.com>.

Nobody is working on SVD yet, but one GSOC applicant has said that they
would like to work on LDA which is a probabilistic relative of SVD.

The approach in your reference (3) is highly amenable to parallel
implementation.

Large-scale SVD would be a very interesting application for Mahout.

On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni <at...@gmail.com>wrote:

> Is there anyone doing the SVD part or are their any SVD algorithm
> implementation on Hadoop? If there are then I would like to implement the
> methods described in [1],[2],[3] for matrix factorization, in specific.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: [gsoc] Collaborative filtering algorithms

Posted by Atul Kulkarni <at...@gmail.com>.

On Wed, Apr 1, 2009 at 1:30 AM, Ted Dunning <te...@gmail.com> wrote:

> I would hope that your SVD implementation would not be limited to NetFlix
> like problems, but would be applicable to any reasonably sparse matrix-like
> data.
>
Yes, ofcourse. it would apply to any large sparse matrix implementation.

>
> Likewise, I would expect a good SVD implementation to be useful for nearest
> neighbor methods or direct prediction by smoothing the history vector.
>
I do not have knowledge about this as of now, will read up and comment.

>
> On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni <atulskulkarni@gmail.com
> >wrote:
>
> > I have worked with Netflix Prize problem and hence most of my suggested
> > algorithms revolve around that problem. But I am open to other algorithms
> > that might be out there. Is this a good thing to do?
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Regards,
Atul Kulkarni
Teaching Assistant,
Department of Computer Science,
University of Minnesota Duluth
Duluth. 55805.
www.d.umn.edu/~kulka053

Re: [gsoc] Collaborative filtering algorithms

Posted by Ted Dunning <te...@gmail.com>.

I would hope that your SVD implementation would not be limited to NetFlix
like problems, but would be applicable to any reasonably sparse matrix-like
data.

Likewise, I would expect a good SVD implementation to be useful for nearest
neighbor methods or direct prediction by smoothing the history vector.

On Tue, Mar 31, 2009 at 11:09 PM, Atul Kulkarni <at...@gmail.com>wrote:

> I have worked with Netflix Prize problem and hence most of my suggested
> algorithms revolve around that problem. But I am open to other algorithms
> that might be out there. Is this a good thing to do?
>

-- 
Ted Dunning, CTO
DeepDyve