You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Richard Simon Just <in...@richardsimonjust.co.uk> on 2010/04/01 19:41:18 UTC

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

Just looking for some clarification. As a GSoC project would the SVD
option mentioned below be a case of integrating the distributed SVD of
MAHOUT-180 with the existing SVDRecommender?

If so is there still a full GSoC project there? or  would I need to
combine it with say making the slope-one recommender fully distributed too?

Many Thanks
Richard

Jake Mannix wrote:
> And to provide some options, I can say that I would certainly help, Sisir,
> with getting the distributed SVD into the recommender framework.  While
> there is not much fundamental "computer science" left to that, there is
> a fair amount of high-performance *engineering* left to do in that
> direction.
>
>   -jake
>
> On Mon, Mar 22, 2010 at 4:09 PM, Sean Owen <sr...@gmail.com> wrote:
>
>   
>> I also second the recommendation to pick one of these ideas and focus
>> on it first, as it will be a lot more work to get it working,
>> documented, and tested!
>>
>> I personally like the idea of getting a distributed SVD-based
>> recommender into the project. The SVD works quite well and all we have
>> now is a non-parallel implementation.
>>
>> However I support Jake's suggestion more that RBM would be most useful
>> at the moment.
>>
>> Pick your area, dig in, and we can help you refine your ideas and begin.
>>
>> Sean
>>
>>     
>
>   

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

Posted by Richard Simon Just <in...@richardsimonjust.co.uk>.
Awesome guys,

Thanks for the quick responses! The details and clarifications are both
helpful and incredibly reassuring. I've never done a proposal before,
but no matter what happens I'm really looking forward to the end of my
exams so I can gear into Mahout properly.

Many thanks
Richard

Sean Owen wrote:
> Your audience is the project committers.
>
> I wouldn't spend much time rehashing the SVD theory. You should name
> your approach and I suppose write enough to make it clear you
> understand the algorithm enough to implement it. In this case you can
> assume we all understand the SVD well enough already.
>
> I think the proposal should rather focus on how you'd structure the
> implementation in the project, what the major classes are, and that
> you've thought reasonably about how long it'll take to build, test,
> and document.
>
> And to reiterate what Jake said, yeah there is already a fairly clear
> structure for Hadoop jobs in the recommender area -- you'd want to
> subclass AbstractJob and run series of Mappers/Reducers. You're going
> to emulate that for sure. You can probably reuse many of the Writable
> representations, in that package and in math or common, to represent
> the intermediate outputs.
>
> What's really key to your proposal is sketching the series of
> map-reduces that you'd drop into this structure to get the job done.
>
>
> On Mon, Apr 5, 2010 at 9:38 PM, Richard Simon Just
> <in...@richardsimonjust.co.uk> wrote:
>   
>> I was wondering when it comes to the proposal what sort of background
>> detail should I be going into? Should I be talking about the use of SVD
>> within a recommender situation for example? Or given that the Mentors
>> already know this should I be discussing purely what sort of SVD-based
>> recommender implementation I'm planning? I guess a question  beside the
>> question is am I aiming the proposal to people who are familiar with
>> Mahout and Machine Learning or to other people as well?
>>     
>
>   

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

Posted by Necati Batur <ne...@gmail.com>.
*IDEA:Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data
for all the algorithms to use *

*Summary*

*            *First of all,I am very excited to join an organization like
GSOC and most importantly work for a big open source Project apache.I am
looking for a good collaboration and new challenges on software
development.Especially information management issues sound great to me.I am
confident to work with all new technologies.I took the data structures I ,
II courses at university so I am ok with data structures.Most importantly I
am interested in databases.From my software engineering courses experience I
know how to work on a project by iterative development and timelining* *

*About Me*

I am a senior student at computer engineering at
iztech<http://english.iyte.edu.tr/main_eng.jsp?pageName=main.htm>in
turkey. My areas of inetrests are information management, OOP(Object
Oriented Programming) and currently bioinformatics. I have been working with
a Asistan Professor(Jens Allmer <http://jens.allmer.de/>) in molecular
biology genetics department for one year.Firstly, we worked on a protein
database 2DB <http://www.2db.de.ms/> and we presented the project in
HIBIT09<http://hibit09.ii.metu.edu.tr/>organization. The Project
 was “Database management system independence by amending 2DB with a
database access layer”. Currently, I am working on another project (Kerb) as
my senior project which is a general sqeuential task management system
intend to reduce the errors and increase time saving in biological
experiments. We will present this project in
HIBIT2010<http://hibit2010.ii.metu.edu.tr/>too.

*My Offer for  Project *

*            *The data adapters fort he higher level languages will require
the good capability of using data structures and some information about
finite mathematics that I am confident on that issues.Then,the code given in
svn repository seems to need some improvements and also documetation.

Briefly,I would do the following operations fort his project

   1. Understand the underlying maths for adapters
   2. Determine the data structures that would be used for adapters
   3. Implement the neccassary methods to handle creation of these
   structures
   4. Some test cases that we probably would need to check whether our code
   cover all the issues required by a data retrieve operations
   5. New iterations on the code to robust the algorithms
   6. Documentation of overall project to join our particular Project to
   overall scope

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

Posted by Sean Owen <sr...@gmail.com>.
Your audience is the project committers.

I wouldn't spend much time rehashing the SVD theory. You should name
your approach and I suppose write enough to make it clear you
understand the algorithm enough to implement it. In this case you can
assume we all understand the SVD well enough already.

I think the proposal should rather focus on how you'd structure the
implementation in the project, what the major classes are, and that
you've thought reasonably about how long it'll take to build, test,
and document.

And to reiterate what Jake said, yeah there is already a fairly clear
structure for Hadoop jobs in the recommender area -- you'd want to
subclass AbstractJob and run series of Mappers/Reducers. You're going
to emulate that for sure. You can probably reuse many of the Writable
representations, in that package and in math or common, to represent
the intermediate outputs.

What's really key to your proposal is sketching the series of
map-reduces that you'd drop into this structure to get the job done.


On Mon, Apr 5, 2010 at 9:38 PM, Richard Simon Just
<in...@richardsimonjust.co.uk> wrote:
> I was wondering when it comes to the proposal what sort of background
> detail should I be going into? Should I be talking about the use of SVD
> within a recommender situation for example? Or given that the Mentors
> already know this should I be discussing purely what sort of SVD-based
> recommender implementation I'm planning? I guess a question  beside the
> question is am I aiming the proposal to people who are familiar with
> Mahout and Machine Learning or to other people as well?

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

Posted by Jake Mannix <ja...@gmail.com>.
Hi Richard,

  A few notes about what would be required to get a nice distributed SVD
recommender in Mahout:  if you look at the current distributed recommenders
(in org.apache.mahout.cf.taste.hadoop package and children), you can see
how it works: using HDFS-backed data, a batch of recommendations is
computed for all users at once, with the output being another HDFS file,
which can then be used in a recommender.

  A distributed SVD recommender would work roughly the same way -
there should be a class with a main() method which fires off a Hadoop
job to transform the data into sparse vector form, doing any weighting
or modification to the matrix (such as normalization), then use the
current DistributedLanczosSolver class to compute the SVD, and
use the resultant singular vectors and values to re-fill in the original
matrix, finding items to recommend (see the current in-memory
SVD-recommender to get a feel for how that part is done).

  I hope this helps you understand a little of the scope of this part
of the project.

  -jake


On Mon, Apr 5, 2010 at 1:38 PM, Richard Simon Just <
info@richardsimonjust.co.uk> wrote:

> Thanks for the super speedy response!
>
> Going on from what you said I've been reading up on the different SVD
> based variants used throughout the Netflix competition and working on my
> proposal. I'm focussing on what you suggested with aiming purely on the
> SVD-based recommender with the possibility of also optimizing the SVD code.
>
> I was wondering when it comes to the proposal what sort of background
> detail should I be going into? Should I be talking about the use of SVD
> within a recommender situation for example? Or given that the Mentors
> already know this should I be discussing purely what sort of SVD-based
> recommender implementation I'm planning? I guess a question  beside the
> question is am I aiming the proposal to people who are familiar with
> Mahout and Machine Learning or to other people as well?
>
> Many thanks
> Richard
>
> Sean Owen wrote:
> > It'd be a matter of making a brand-new distributed recommender. It
> > need not have anything to do with SVDRecommender, which is a fine but
> > separate non-parallel implementation.
> >
> > Tacking on distributed slope-one is fairly easy, I think. Both
> > together, with testing, documentation, etc. are certainly big enough
> > for a GSoC project, probably a bit too large.
> >
> > I'd be pleased to see someone do a quite thorough job with an
> > SVD-based recommender, and perhaps along the way analyzing and
> > optimizing the SVD impl itself, and documenting and testing well and
> > so on. That's a nice project IMHO.
> >
> > On Thu, Apr 1, 2010 at 6:41 PM, Richard Simon Just
> > <in...@richardsimonjust.co.uk> wrote:
> >
> >> Just looking for some clarification. As a GSoC project would the SVD
> >> option mentioned below be a case of integrating the distributed SVD of
> >> MAHOUT-180 with the existing SVDRecommender?
> >>
> >> If so is there still a full GSoC project there? or  would I need to
> >> combine it with say making the slope-one recommender fully distributed
> too?
> >>
> >> Many Thanks
> >> Richard
> >>
> >>
> >
> >
>

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

Posted by Richard Simon Just <in...@richardsimonjust.co.uk>.
Thanks for the super speedy response!

Going on from what you said I've been reading up on the different SVD
based variants used throughout the Netflix competition and working on my
proposal. I'm focussing on what you suggested with aiming purely on the
SVD-based recommender with the possibility of also optimizing the SVD code.

I was wondering when it comes to the proposal what sort of background
detail should I be going into? Should I be talking about the use of SVD
within a recommender situation for example? Or given that the Mentors
already know this should I be discussing purely what sort of SVD-based
recommender implementation I'm planning? I guess a question  beside the
question is am I aiming the proposal to people who are familiar with
Mahout and Machine Learning or to other people as well?

Many thanks
Richard

Sean Owen wrote:
> It'd be a matter of making a brand-new distributed recommender. It
> need not have anything to do with SVDRecommender, which is a fine but
> separate non-parallel implementation.
>
> Tacking on distributed slope-one is fairly easy, I think. Both
> together, with testing, documentation, etc. are certainly big enough
> for a GSoC project, probably a bit too large.
>
> I'd be pleased to see someone do a quite thorough job with an
> SVD-based recommender, and perhaps along the way analyzing and
> optimizing the SVD impl itself, and documenting and testing well and
> so on. That's a nice project IMHO.
>
> On Thu, Apr 1, 2010 at 6:41 PM, Richard Simon Just
> <in...@richardsimonjust.co.uk> wrote:
>   
>> Just looking for some clarification. As a GSoC project would the SVD
>> option mentioned below be a case of integrating the distributed SVD of
>> MAHOUT-180 with the existing SVDRecommender?
>>
>> If so is there still a full GSoC project there? or  would I need to
>> combine it with say making the slope-one recommender fully distributed too?
>>
>> Many Thanks
>> Richard
>>
>>     
>
>   

Re: Reg. Netflix Prize Apache Mahout GSoC Application (SVD option)

Posted by Sean Owen <sr...@gmail.com>.
It'd be a matter of making a brand-new distributed recommender. It
need not have anything to do with SVDRecommender, which is a fine but
separate non-parallel implementation.

Tacking on distributed slope-one is fairly easy, I think. Both
together, with testing, documentation, etc. are certainly big enough
for a GSoC project, probably a bit too large.

I'd be pleased to see someone do a quite thorough job with an
SVD-based recommender, and perhaps along the way analyzing and
optimizing the SVD impl itself, and documenting and testing well and
so on. That's a nice project IMHO.

On Thu, Apr 1, 2010 at 6:41 PM, Richard Simon Just
<in...@richardsimonjust.co.uk> wrote:
> Just looking for some clarification. As a GSoC project would the SVD
> option mentioned below be a case of integrating the distributed SVD of
> MAHOUT-180 with the existing SVDRecommender?
>
> If so is there still a full GSoC project there? or  would I need to
> combine it with say making the slope-one recommender fully distributed too?
>
> Many Thanks
> Richard
>