You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2010/10/04 16:03:37 UTC

Some ideas for Mahout 0.5

Hi,

The amount of work that is currently put in finishing 0.4 is amazing, I 
can hardly follow all the mails, very cool to see that. I've had some 
time today to write down ideas of features I have for version 0.5 and 
want to share it here for feedback.

First I can think of possible new features for RecommenderJob

   * add an option that makes the RecommenderJob use the output of the 
related
     o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of 
computing
     the similarities again each time, this will give users the 
possibility to
     choose the interval in which to precompute the item similarities

   * add an option to make the RecommenderJob include "recommended 
because of"
     items to each recommended item (analogous to what is already 
available at
     GenericItemBasedRecommender.recommendedBecause(...)), showing this 
to users
     helps them understand why some item was recommended to them


Second I'd like Mahout to have a Map/Reduce implementation of the 
algorithm described in Y. Zhou et al.: "Large-scale Parallel 
Collaborative Filtering for the Netflix Prize" (http://bit.ly/cUPgqr).

Here R is the matrix of ratings of users towards movies and each user 
and each movie is projected on a "feature" space (the number of features 
is defined before) so that the product of the resulting matrices U and M 
is a low-rank approximization/factorization of R.

Determining U and M is mathematically modelled as an optimization 
problem and additionally some regularization is applied to avoid 
overfitting to the known entries. This problem is solved with an 
iterative approach called alternate least squares (ALS).

If I understand the paper correctly this approach is easily 
parallelizable. In order to estimate an user feature vector you need 
only access to all his ratings and the feature vectors of all movies 
he/she rated. To estimate a movie feature vector you need access to all 
its ratings and to the feature vectors of the users who rated it.

An unknown preference can then be predicted by computing the dot product 
of the according user and movie feature vectors.

Would be very nice if someone who is familiar with the paper or has the 
time for a brief look into it could validate that, cause I don't fully 
trust my mathematical analysis.

I already created a first prototype implementation but I definitely need 
help from someone checking it conceptually, optimizing the math related 
parts and help me test ist. Maybe that could be an interesting task for 
the upcoming Mahout hackathon in Berlin.

--sebastian

PS: @isabel I won't make it to the dinner today, need to rehearse my 
talk...

Re: Some ideas for Mahout 0.5

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Oct 4, 2010 at 11:44 AM, deneche abdelhakim <ad...@gmail.com>wrote:

> For Decision Forests, my goal for 0.5 is to add a 'full'
> implementation. Meaning, an implementation that can build random
> forests using the whole dataset, even if its split among many
> machines. I found the following paper to be very interesting:
> http://www.cba.ua.edu/~mhardin/rainforest.pdf
> although the described approach doesn't work as it is for numerical
> attributes.
>

Very cool.

I would love it if DF became a first class Mahout classifier.

As well as scaling up, it would be very nice if there were a model
compression step to help with the deployment of DF
models.



>
> The implementation should at least work for the following dataset:
>
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2304&categoryID=248
> it's 50 GB, and a small subset is available in UCI. It contains only
> categorical attributes, and it's big enough to be a good candidate.
>

Which UCI dataset is this?  The income>50k$ one?

Does the AWS dataset have household income?

Re: Some ideas for Mahout 0.5

Posted by deneche abdelhakim <ad...@gmail.com>.
For Decision Forests, my goal for 0.5 is to add a 'full'
implementation. Meaning, an implementation that can build random
forests using the whole dataset, even if its split among many
machines. I found the following paper to be very interesting:
http://www.cba.ua.edu/~mhardin/rainforest.pdf
although the described approach doesn't work as it is for numerical attributes.

The implementation should at least work for the following dataset:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2304&categoryID=248
it's 50 GB, and a small subset is available in UCI. It contains only
categorical attributes, and it's big enough to be a good candidate.

In another note, my svn password has not been restored yet, so I am
more a contributor than a committer =P

On Mon, Oct 4, 2010 at 3:11 PM, Ted Dunning <te...@gmail.com> wrote:
> My own feeling is that we need to get some sort of recommender that supports
> side information, possibly also as a classifier.
>
> As everybody knows, I have been lately quite enamored of Menon and Elkan's
> paper on Latent Factor Log-Linear models.  It seems
> to subsume most other factorization methods and supports side data very
> naturally.  Training is reportedly very fast using SGD
> techniques.
>
> The paper is here: http://arxiv.org/abs/1006.2156
>
> On Mon, Oct 4, 2010 at 7:03 AM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> Hi,
>>
>> The amount of work that is currently put in finishing 0.4 is amazing, I can
>> hardly follow all the mails, very cool to see that. I've had some time today
>> to write down ideas of features I have for version 0.5 and want to share it
>> here for feedback.
>>
>> First I can think of possible new features for RecommenderJob
>>
>>  * add an option that makes the RecommenderJob use the output of the
>> related
>>    o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of
>> computing
>>    the similarities again each time, this will give users the possibility
>> to
>>    choose the interval in which to precompute the item similarities
>>
>>  * add an option to make the RecommenderJob include "recommended because
>> of"
>>    items to each recommended item (analogous to what is already available
>> at
>>    GenericItemBasedRecommender.recommendedBecause(...)), showing this to
>> users
>>    helps them understand why some item was recommended to them
>>
>>
>> Second I'd like Mahout to have a Map/Reduce implementation of the algorithm
>> described in Y. Zhou et al.: "Large-scale Parallel Collaborative Filtering
>> for the Netflix Prize" (http://bit.ly/cUPgqr).
>>
>> Here R is the matrix of ratings of users towards movies and each user and
>> each movie is projected on a "feature" space (the number of features is
>> defined before) so that the product of the resulting matrices U and M is a
>> low-rank approximization/factorization of R.
>>
>> Determining U and M is mathematically modelled as an optimization problem
>> and additionally some regularization is applied to avoid overfitting to the
>> known entries. This problem is solved with an iterative approach called
>> alternate least squares (ALS).
>>
>> If I understand the paper correctly this approach is easily parallelizable.
>> In order to estimate an user feature vector you need only access to all his
>> ratings and the feature vectors of all movies he/she rated. To estimate a
>> movie feature vector you need access to all its ratings and to the feature
>> vectors of the users who rated it.
>>
>> An unknown preference can then be predicted by computing the dot product of
>> the according user and movie feature vectors.
>>
>> Would be very nice if someone who is familiar with the paper or has the
>> time for a brief look into it could validate that, cause I don't fully trust
>> my mathematical analysis.
>>
>> I already created a first prototype implementation but I definitely need
>> help from someone checking it conceptually, optimizing the math related
>> parts and help me test ist. Maybe that could be an interesting task for the
>> upcoming Mahout hackathon in Berlin.
>>
>> --sebastian
>>
>> PS: @isabel I won't make it to the dinner today, need to rehearse my
>> talk...
>>
>

Re: Some ideas for Mahout 0.5

Posted by Ted Dunning <te...@gmail.com>.
My own feeling is that we need to get some sort of recommender that supports
side information, possibly also as a classifier.

As everybody knows, I have been lately quite enamored of Menon and Elkan's
paper on Latent Factor Log-Linear models.  It seems
to subsume most other factorization methods and supports side data very
naturally.  Training is reportedly very fast using SGD
techniques.

The paper is here: http://arxiv.org/abs/1006.2156

On Mon, Oct 4, 2010 at 7:03 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> The amount of work that is currently put in finishing 0.4 is amazing, I can
> hardly follow all the mails, very cool to see that. I've had some time today
> to write down ideas of features I have for version 0.5 and want to share it
> here for feedback.
>
> First I can think of possible new features for RecommenderJob
>
>  * add an option that makes the RecommenderJob use the output of the
> related
>    o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of
> computing
>    the similarities again each time, this will give users the possibility
> to
>    choose the interval in which to precompute the item similarities
>
>  * add an option to make the RecommenderJob include "recommended because
> of"
>    items to each recommended item (analogous to what is already available
> at
>    GenericItemBasedRecommender.recommendedBecause(...)), showing this to
> users
>    helps them understand why some item was recommended to them
>
>
> Second I'd like Mahout to have a Map/Reduce implementation of the algorithm
> described in Y. Zhou et al.: "Large-scale Parallel Collaborative Filtering
> for the Netflix Prize" (http://bit.ly/cUPgqr).
>
> Here R is the matrix of ratings of users towards movies and each user and
> each movie is projected on a "feature" space (the number of features is
> defined before) so that the product of the resulting matrices U and M is a
> low-rank approximization/factorization of R.
>
> Determining U and M is mathematically modelled as an optimization problem
> and additionally some regularization is applied to avoid overfitting to the
> known entries. This problem is solved with an iterative approach called
> alternate least squares (ALS).
>
> If I understand the paper correctly this approach is easily parallelizable.
> In order to estimate an user feature vector you need only access to all his
> ratings and the feature vectors of all movies he/she rated. To estimate a
> movie feature vector you need access to all its ratings and to the feature
> vectors of the users who rated it.
>
> An unknown preference can then be predicted by computing the dot product of
> the according user and movie feature vectors.
>
> Would be very nice if someone who is familiar with the paper or has the
> time for a brief look into it could validate that, cause I don't fully trust
> my mathematical analysis.
>
> I already created a first prototype implementation but I definitely need
> help from someone checking it conceptually, optimizing the math related
> parts and help me test ist. Maybe that could be an interesting task for the
> upcoming Mahout hackathon in Berlin.
>
> --sebastian
>
> PS: @isabel I won't make it to the dinner today, need to rehearse my
> talk...
>