You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2010/10/04 16:03:37 UTC
Some ideas for Mahout 0.5
Hi,
The amount of work that is currently put in finishing 0.4 is amazing, I
can hardly follow all the mails, very cool to see that. I've had some
time today to write down ideas of features I have for version 0.5 and
want to share it here for feedback.
First I can think of possible new features for RecommenderJob
* add an option that makes the RecommenderJob use the output of the
related
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of
computing
the similarities again each time, this will give users the
possibility to
choose the interval in which to precompute the item similarities
* add an option to make the RecommenderJob include "recommended
because of"
items to each recommended item (analogous to what is already
available at
GenericItemBasedRecommender.recommendedBecause(...)), showing this
to users
helps them understand why some item was recommended to them
Second I'd like Mahout to have a Map/Reduce implementation of the
algorithm described in Y. Zhou et al.: "Large-scale Parallel
Collaborative Filtering for the Netflix Prize" (http://bit.ly/cUPgqr).
Here R is the matrix of ratings of users towards movies and each user
and each movie is projected on a "feature" space (the number of features
is defined before) so that the product of the resulting matrices U and M
is a low-rank approximization/factorization of R.
Determining U and M is mathematically modelled as an optimization
problem and additionally some regularization is applied to avoid
overfitting to the known entries. This problem is solved with an
iterative approach called alternate least squares (ALS).
If I understand the paper correctly this approach is easily
parallelizable. In order to estimate an user feature vector you need
only access to all his ratings and the feature vectors of all movies
he/she rated. To estimate a movie feature vector you need access to all
its ratings and to the feature vectors of the users who rated it.
An unknown preference can then be predicted by computing the dot product
of the according user and movie feature vectors.
Would be very nice if someone who is familiar with the paper or has the
time for a brief look into it could validate that, cause I don't fully
trust my mathematical analysis.
I already created a first prototype implementation but I definitely need
help from someone checking it conceptually, optimizing the math related
parts and help me test ist. Maybe that could be an interesting task for
the upcoming Mahout hackathon in Berlin.
--sebastian
PS: @isabel I won't make it to the dinner today, need to rehearse my
talk...
Re: Some ideas for Mahout 0.5
Posted by Ted Dunning <te...@gmail.com>.
On Mon, Oct 4, 2010 at 11:44 AM, deneche abdelhakim <ad...@gmail.com>wrote:
> For Decision Forests, my goal for 0.5 is to add a 'full'
> implementation. Meaning, an implementation that can build random
> forests using the whole dataset, even if its split among many
> machines. I found the following paper to be very interesting:
> http://www.cba.ua.edu/~mhardin/rainforest.pdf
> although the described approach doesn't work as it is for numerical
> attributes.
>
Very cool.
I would love it if DF became a first class Mahout classifier.
As well as scaling up, it would be very nice if there were a model
compression step to help with the deployment of DF
models.
>
> The implementation should at least work for the following dataset:
>
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2304&categoryID=248
> it's 50 GB, and a small subset is available in UCI. It contains only
> categorical attributes, and it's big enough to be a good candidate.
>
Which UCI dataset is this? The income>50k$ one?
Does the AWS dataset have household income?
Re: Some ideas for Mahout 0.5
Posted by deneche abdelhakim <ad...@gmail.com>.
For Decision Forests, my goal for 0.5 is to add a 'full'
implementation. Meaning, an implementation that can build random
forests using the whole dataset, even if its split among many
machines. I found the following paper to be very interesting:
http://www.cba.ua.edu/~mhardin/rainforest.pdf
although the described approach doesn't work as it is for numerical attributes.
The implementation should at least work for the following dataset:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2304&categoryID=248
it's 50 GB, and a small subset is available in UCI. It contains only
categorical attributes, and it's big enough to be a good candidate.
In another note, my svn password has not been restored yet, so I am
more a contributor than a committer =P
On Mon, Oct 4, 2010 at 3:11 PM, Ted Dunning <te...@gmail.com> wrote:
> My own feeling is that we need to get some sort of recommender that supports
> side information, possibly also as a classifier.
>
> As everybody knows, I have been lately quite enamored of Menon and Elkan's
> paper on Latent Factor Log-Linear models. It seems
> to subsume most other factorization methods and supports side data very
> naturally. Training is reportedly very fast using SGD
> techniques.
>
> The paper is here: http://arxiv.org/abs/1006.2156
>
> On Mon, Oct 4, 2010 at 7:03 AM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> Hi,
>>
>> The amount of work that is currently put in finishing 0.4 is amazing, I can
>> hardly follow all the mails, very cool to see that. I've had some time today
>> to write down ideas of features I have for version 0.5 and want to share it
>> here for feedback.
>>
>> First I can think of possible new features for RecommenderJob
>>
>> * add an option that makes the RecommenderJob use the output of the
>> related
>> o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of
>> computing
>> the similarities again each time, this will give users the possibility
>> to
>> choose the interval in which to precompute the item similarities
>>
>> * add an option to make the RecommenderJob include "recommended because
>> of"
>> items to each recommended item (analogous to what is already available
>> at
>> GenericItemBasedRecommender.recommendedBecause(...)), showing this to
>> users
>> helps them understand why some item was recommended to them
>>
>>
>> Second I'd like Mahout to have a Map/Reduce implementation of the algorithm
>> described in Y. Zhou et al.: "Large-scale Parallel Collaborative Filtering
>> for the Netflix Prize" (http://bit.ly/cUPgqr).
>>
>> Here R is the matrix of ratings of users towards movies and each user and
>> each movie is projected on a "feature" space (the number of features is
>> defined before) so that the product of the resulting matrices U and M is a
>> low-rank approximization/factorization of R.
>>
>> Determining U and M is mathematically modelled as an optimization problem
>> and additionally some regularization is applied to avoid overfitting to the
>> known entries. This problem is solved with an iterative approach called
>> alternate least squares (ALS).
>>
>> If I understand the paper correctly this approach is easily parallelizable.
>> In order to estimate an user feature vector you need only access to all his
>> ratings and the feature vectors of all movies he/she rated. To estimate a
>> movie feature vector you need access to all its ratings and to the feature
>> vectors of the users who rated it.
>>
>> An unknown preference can then be predicted by computing the dot product of
>> the according user and movie feature vectors.
>>
>> Would be very nice if someone who is familiar with the paper or has the
>> time for a brief look into it could validate that, cause I don't fully trust
>> my mathematical analysis.
>>
>> I already created a first prototype implementation but I definitely need
>> help from someone checking it conceptually, optimizing the math related
>> parts and help me test ist. Maybe that could be an interesting task for the
>> upcoming Mahout hackathon in Berlin.
>>
>> --sebastian
>>
>> PS: @isabel I won't make it to the dinner today, need to rehearse my
>> talk...
>>
>
Re: Some ideas for Mahout 0.5
Posted by Ted Dunning <te...@gmail.com>.
My own feeling is that we need to get some sort of recommender that supports
side information, possibly also as a classifier.
As everybody knows, I have been lately quite enamored of Menon and Elkan's
paper on Latent Factor Log-Linear models. It seems
to subsume most other factorization methods and supports side data very
naturally. Training is reportedly very fast using SGD
techniques.
The paper is here: http://arxiv.org/abs/1006.2156
On Mon, Oct 4, 2010 at 7:03 AM, Sebastian Schelter <ss...@apache.org> wrote:
> Hi,
>
> The amount of work that is currently put in finishing 0.4 is amazing, I can
> hardly follow all the mails, very cool to see that. I've had some time today
> to write down ideas of features I have for version 0.5 and want to share it
> here for feedback.
>
> First I can think of possible new features for RecommenderJob
>
> * add an option that makes the RecommenderJob use the output of the
> related
> o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob instead of
> computing
> the similarities again each time, this will give users the possibility
> to
> choose the interval in which to precompute the item similarities
>
> * add an option to make the RecommenderJob include "recommended because
> of"
> items to each recommended item (analogous to what is already available
> at
> GenericItemBasedRecommender.recommendedBecause(...)), showing this to
> users
> helps them understand why some item was recommended to them
>
>
> Second I'd like Mahout to have a Map/Reduce implementation of the algorithm
> described in Y. Zhou et al.: "Large-scale Parallel Collaborative Filtering
> for the Netflix Prize" (http://bit.ly/cUPgqr).
>
> Here R is the matrix of ratings of users towards movies and each user and
> each movie is projected on a "feature" space (the number of features is
> defined before) so that the product of the resulting matrices U and M is a
> low-rank approximization/factorization of R.
>
> Determining U and M is mathematically modelled as an optimization problem
> and additionally some regularization is applied to avoid overfitting to the
> known entries. This problem is solved with an iterative approach called
> alternate least squares (ALS).
>
> If I understand the paper correctly this approach is easily parallelizable.
> In order to estimate an user feature vector you need only access to all his
> ratings and the feature vectors of all movies he/she rated. To estimate a
> movie feature vector you need access to all its ratings and to the feature
> vectors of the users who rated it.
>
> An unknown preference can then be predicted by computing the dot product of
> the according user and movie feature vectors.
>
> Would be very nice if someone who is familiar with the paper or has the
> time for a brief look into it could validate that, cause I don't fully trust
> my mathematical analysis.
>
> I already created a first prototype implementation but I definitely need
> help from someone checking it conceptually, optimizing the math related
> parts and help me test ist. Maybe that could be an interesting task for the
> upcoming Mahout hackathon in Berlin.
>
> --sebastian
>
> PS: @isabel I won't make it to the dinner today, need to rehearse my
> talk...
>