You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Kasun Lakpriya <ka...@gmail.com> on 2010/12/29 17:51:39 UTC

Regarding Collaborative Filtering.

Hi all,
I am Kasun Lakpriya from University of Moratuwa, Sri Lanka. I am following a
BSc in Computer Science and Engineering degree and now I am in my final
year.

In our degree program in order to complete the degree we need to do some
kind of a research project approved by the university. The project I am
working on is about "Web Personalization". The task is to develop a
personalization module which is pluggable to any (theoretically) web
application. After some literature survey we found out that there are some
existing open source tools we can use to implement this module
(personalization module). Specially what we are focusing on is Collaborative
Filtering. I have already checked out the mahout trunk and
built successfully and tried this example I found on the web [1]. And I went
through the wiki page related to Algorithms and found some nice presentation
about "Distributed item based collaborative filtering" by Sebastian
Schelter. And I went through some similarity measure implementations in
Mahout.

What I want from you all is some guidance and helping hand to start
implementation on improving an algorithm already there in the Mahout or what
are the other areas we can integrated to Mahout regarding to Collaborative
Filtering. In the recent mail archives I couldn't find such a discussion
regarding this thing. Any further reading or references would be
really appreciated.


Thanks and Regards,
Kasun

[1] -
http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-
mahout-taste/

Re: Regarding Collaborative Filtering.

Posted by Lance Norskog <go...@gmail.com>.

Another great contribution would be small or mid-sized datasets and
"gold master" output sets for some of the standard computations. This
problem requires both gold masters and evaluation algorithms for
numerical variations against the masters.

This would be >very< educational about how Recommenders, Matrix
arithmetic, Classifiers etc. work.  Hell, I should do it.

On Thu, Jan 20, 2011 at 9:58 AM, Kasun Lakpriya
<ka...@gmail.com> wrote:
> Thanks Sean and Sebastian.
>
> Yes, it's still far away, just finished documentation stuff.
>
> I will go though these stuff (Thanks for the links Sebastian) and try to get
> familiar with Mahout. After that I can go in to your suggestions one by
> one.
>
> On Thu, Jan 20, 2011 at 1:46 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> I'd be very interested in benchmark data for and/or performance increases
>> of RecommenderJob (as well as ItemSimilarityJob and RowSimilarityJob which
>> are used internally), if you feel like working on that.
>>
>> A good starting point to get familiar with the functionality might be
>> Sean's talk from Berlin Buzzwords (
>> http://berlinbuzzwords.blip.tv/file/3811036/ ) and my slides from Berlin's
>> last Hadoop Get Together ( http://www.slideshare.net/sscdotopen/mahoutcf )
>>
>> --sebastian
>>
>>
>> On 20.01.2011 09:08, Sean Owen wrote:
>>
>>> I think it's far from complete or done.
>>>
>>> I think it would be interesting to take any of the MapReduce-based jobs,
>>> set
>>> it up, run it, and benchmark/profile it to locate some bottlenecks, then
>>> propose optimizations. It is a good way to get familiar with the packages.
>>>
>>> You might also investigate suggested settings for Hadoop when running
>>> these
>>> jobs.
>>>
>>> These are just one type of way you could contribute. Looking into open
>>> issues in JIRA, or adding unit tests, would be fine too.
>>>
>>> On Thu, Jan 20, 2011 at 3:36 AM, Kasun Lakpriya
>>> <ka...@gmail.com>wrote:
>>>
>>>  Hi Sean,
>>>> Thanks for the immediate reply and sorry for my late response.
>>>>
>>>> Our above mentioned project is in progress.
>>>>
>>>> BTW I realized that Mahout is quite interesting and very active project.
>>>> I
>>>> am just interested about contributing to Mahout. As understanding the
>>>> complete code base is not an easy task I would like to start from some
>>>> basic
>>>> point. After getting familiar with the code base I can think of your
>>>> suggestion about "improving its speed or reducing its memory/disk usage".
>>>>
>>>> So that what would be a good starting point?
>>>>
>>>> Thank you,
>>>> Kasun
>>>>
>>>> On Thu, Dec 30, 2010 at 5:56 PM, Sean Owen<sr...@gmail.com>  wrote:
>>>>
>>>>  Hi Kasun,
>>>>>
>>>>> If you want to get involved, you are free to discuss and propose your
>>>>> own
>>>>> changes and algorithms. You can review the list of open issues here:
>>>>> https://issues.apache.org/jira/browse/MAHOUT This contains some ideas
>>>>> about
>>>>> work that needs to be done.
>>>>>
>>>>> One interesting project would be to benchmark the existing distributed
>>>>> item-based recommender and find ways to improve its speed or reduce its
>>>>> memory/disk usage. That's a fairly simple starter project and quite
>>>>>
>>>> useful.
>>>>
>>>>> Sean
>>>>>
>>>>> On Wed, Dec 29, 2010 at 10:51 AM, Kasun Lakpriya<
>>>>> kasun.lakpriya86@gmail.com
>>>>>
>>>>>> wrote:
>>>>>> Hi all,
>>>>>> I am Kasun Lakpriya from University of Moratuwa, Sri Lanka. I am
>>>>>>
>>>>> following
>>>>>
>>>>>> a
>>>>>> BSc in Computer Science and Engineering degree and now I am in my final
>>>>>> year.
>>>>>>
>>>>>> In our degree program in order to complete the degree we need to do
>>>>>>
>>>>> some
>>>>
>>>>> kind of a research project approved by the university. The project I am
>>>>>> working on is about "Web Personalization". The task is to develop a
>>>>>> personalization module which is pluggable to any (theoretically) web
>>>>>> application. After some literature survey we found out that there are
>>>>>>
>>>>> some
>>>>>
>>>>>> existing open source tools we can use to implement this module
>>>>>> (personalization module). Specially what we are focusing on is
>>>>>> Collaborative
>>>>>> Filtering. I have already checked out the mahout trunk and
>>>>>> built successfully and tried this example I found on the web [1]. And I
>>>>>> went
>>>>>> through the wiki page related to Algorithms and found some nice
>>>>>> presentation
>>>>>> about "Distributed item based collaborative filtering" by Sebastian
>>>>>> Schelter. And I went through some similarity measure implementations in
>>>>>> Mahout.
>>>>>>
>>>>>> What I want from you all is some guidance and helping hand to start
>>>>>> implementation on improving an algorithm already there in the Mahout or
>>>>>> what
>>>>>> are the other areas we can integrated to Mahout regarding to
>>>>>>
>>>>> Collaborative
>>>>>
>>>>>> Filtering. In the recent mail archives I couldn't find such a
>>>>>>
>>>>> discussion
>>>>
>>>>> regarding this thing. Any further reading or references would be
>>>>>> really appreciated.
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Kasun
>>>>>>
>>>>>> [1] -
>>>>>>
>>>>>>
>>>>>>
>>>> http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-
>>>>
>>>>> mahout-taste/
>>>>>>
>>>>>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Regarding Collaborative Filtering.

Posted by Kasun Lakpriya <ka...@gmail.com>.

Thanks Sean and Sebastian.

Yes, it's still far away, just finished documentation stuff.

I will go though these stuff (Thanks for the links Sebastian) and try to get
familiar with Mahout. After that I can go in to your suggestions one by
one.

On Thu, Jan 20, 2011 at 1:46 PM, Sebastian Schelter <ss...@apache.org> wrote:

> I'd be very interested in benchmark data for and/or performance increases
> of RecommenderJob (as well as ItemSimilarityJob and RowSimilarityJob which
> are used internally), if you feel like working on that.
>
> A good starting point to get familiar with the functionality might be
> Sean's talk from Berlin Buzzwords (
> http://berlinbuzzwords.blip.tv/file/3811036/ ) and my slides from Berlin's
> last Hadoop Get Together ( http://www.slideshare.net/sscdotopen/mahoutcf )
>
> --sebastian
>
>
> On 20.01.2011 09:08, Sean Owen wrote:
>
>> I think it's far from complete or done.
>>
>> I think it would be interesting to take any of the MapReduce-based jobs,
>> set
>> it up, run it, and benchmark/profile it to locate some bottlenecks, then
>> propose optimizations. It is a good way to get familiar with the packages.
>>
>> You might also investigate suggested settings for Hadoop when running
>> these
>> jobs.
>>
>> These are just one type of way you could contribute. Looking into open
>> issues in JIRA, or adding unit tests, would be fine too.
>>
>> On Thu, Jan 20, 2011 at 3:36 AM, Kasun Lakpriya
>> <ka...@gmail.com>wrote:
>>
>>  Hi Sean,
>>> Thanks for the immediate reply and sorry for my late response.
>>>
>>> Our above mentioned project is in progress.
>>>
>>> BTW I realized that Mahout is quite interesting and very active project.
>>> I
>>> am just interested about contributing to Mahout. As understanding the
>>> complete code base is not an easy task I would like to start from some
>>> basic
>>> point. After getting familiar with the code base I can think of your
>>> suggestion about "improving its speed or reducing its memory/disk usage".
>>>
>>> So that what would be a good starting point?
>>>
>>> Thank you,
>>> Kasun
>>>
>>> On Thu, Dec 30, 2010 at 5:56 PM, Sean Owen<sr...@gmail.com>  wrote:
>>>
>>>  Hi Kasun,
>>>>
>>>> If you want to get involved, you are free to discuss and propose your
>>>> own
>>>> changes and algorithms. You can review the list of open issues here:
>>>> https://issues.apache.org/jira/browse/MAHOUT This contains some ideas
>>>> about
>>>> work that needs to be done.
>>>>
>>>> One interesting project would be to benchmark the existing distributed
>>>> item-based recommender and find ways to improve its speed or reduce its
>>>> memory/disk usage. That's a fairly simple starter project and quite
>>>>
>>> useful.
>>>
>>>> Sean
>>>>
>>>> On Wed, Dec 29, 2010 at 10:51 AM, Kasun Lakpriya<
>>>> kasun.lakpriya86@gmail.com
>>>>
>>>>> wrote:
>>>>> Hi all,
>>>>> I am Kasun Lakpriya from University of Moratuwa, Sri Lanka. I am
>>>>>
>>>> following
>>>>
>>>>> a
>>>>> BSc in Computer Science and Engineering degree and now I am in my final
>>>>> year.
>>>>>
>>>>> In our degree program in order to complete the degree we need to do
>>>>>
>>>> some
>>>
>>>> kind of a research project approved by the university. The project I am
>>>>> working on is about "Web Personalization". The task is to develop a
>>>>> personalization module which is pluggable to any (theoretically) web
>>>>> application. After some literature survey we found out that there are
>>>>>
>>>> some
>>>>
>>>>> existing open source tools we can use to implement this module
>>>>> (personalization module). Specially what we are focusing on is
>>>>> Collaborative
>>>>> Filtering. I have already checked out the mahout trunk and
>>>>> built successfully and tried this example I found on the web [1]. And I
>>>>> went
>>>>> through the wiki page related to Algorithms and found some nice
>>>>> presentation
>>>>> about "Distributed item based collaborative filtering" by Sebastian
>>>>> Schelter. And I went through some similarity measure implementations in
>>>>> Mahout.
>>>>>
>>>>> What I want from you all is some guidance and helping hand to start
>>>>> implementation on improving an algorithm already there in the Mahout or
>>>>> what
>>>>> are the other areas we can integrated to Mahout regarding to
>>>>>
>>>> Collaborative
>>>>
>>>>> Filtering. In the recent mail archives I couldn't find such a
>>>>>
>>>> discussion
>>>
>>>> regarding this thing. Any further reading or references would be
>>>>> really appreciated.
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>> Kasun
>>>>>
>>>>> [1] -
>>>>>
>>>>>
>>>>>
>>> http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-
>>>
>>>> mahout-taste/
>>>>>
>>>>>
>

Re: Regarding Collaborative Filtering.

Posted by Sebastian Schelter <ss...@apache.org>.

I'd be very interested in benchmark data for and/or performance 
increases of RecommenderJob (as well as ItemSimilarityJob and 
RowSimilarityJob which are used internally), if you feel like working on 
that.

A good starting point to get familiar with the functionality might be 
Sean's talk from Berlin Buzzwords ( 
http://berlinbuzzwords.blip.tv/file/3811036/ ) and my slides from 
Berlin's last Hadoop Get Together ( 
http://www.slideshare.net/sscdotopen/mahoutcf )

--sebastian

On 20.01.2011 09:08, Sean Owen wrote:
> I think it's far from complete or done.
>
> I think it would be interesting to take any of the MapReduce-based jobs, set
> it up, run it, and benchmark/profile it to locate some bottlenecks, then
> propose optimizations. It is a good way to get familiar with the packages.
>
> You might also investigate suggested settings for Hadoop when running these
> jobs.
>
> These are just one type of way you could contribute. Looking into open
> issues in JIRA, or adding unit tests, would be fine too.
>
> On Thu, Jan 20, 2011 at 3:36 AM, Kasun Lakpriya
> <ka...@gmail.com>wrote:
>
>> Hi Sean,
>> Thanks for the immediate reply and sorry for my late response.
>>
>> Our above mentioned project is in progress.
>>
>> BTW I realized that Mahout is quite interesting and very active project. I
>> am just interested about contributing to Mahout. As understanding the
>> complete code base is not an easy task I would like to start from some
>> basic
>> point. After getting familiar with the code base I can think of your
>> suggestion about "improving its speed or reducing its memory/disk usage".
>>
>> So that what would be a good starting point?
>>
>> Thank you,
>> Kasun
>>
>> On Thu, Dec 30, 2010 at 5:56 PM, Sean Owen<sr...@gmail.com>  wrote:
>>
>>> Hi Kasun,
>>>
>>> If you want to get involved, you are free to discuss and propose your own
>>> changes and algorithms. You can review the list of open issues here:
>>> https://issues.apache.org/jira/browse/MAHOUT This contains some ideas
>>> about
>>> work that needs to be done.
>>>
>>> One interesting project would be to benchmark the existing distributed
>>> item-based recommender and find ways to improve its speed or reduce its
>>> memory/disk usage. That's a fairly simple starter project and quite
>> useful.
>>> Sean
>>>
>>> On Wed, Dec 29, 2010 at 10:51 AM, Kasun Lakpriya<
>>> kasun.lakpriya86@gmail.com
>>>> wrote:
>>>> Hi all,
>>>> I am Kasun Lakpriya from University of Moratuwa, Sri Lanka. I am
>>> following
>>>> a
>>>> BSc in Computer Science and Engineering degree and now I am in my final
>>>> year.
>>>>
>>>> In our degree program in order to complete the degree we need to do
>> some
>>>> kind of a research project approved by the university. The project I am
>>>> working on is about "Web Personalization". The task is to develop a
>>>> personalization module which is pluggable to any (theoretically) web
>>>> application. After some literature survey we found out that there are
>>> some
>>>> existing open source tools we can use to implement this module
>>>> (personalization module). Specially what we are focusing on is
>>>> Collaborative
>>>> Filtering. I have already checked out the mahout trunk and
>>>> built successfully and tried this example I found on the web [1]. And I
>>>> went
>>>> through the wiki page related to Algorithms and found some nice
>>>> presentation
>>>> about "Distributed item based collaborative filtering" by Sebastian
>>>> Schelter. And I went through some similarity measure implementations in
>>>> Mahout.
>>>>
>>>> What I want from you all is some guidance and helping hand to start
>>>> implementation on improving an algorithm already there in the Mahout or
>>>> what
>>>> are the other areas we can integrated to Mahout regarding to
>>> Collaborative
>>>> Filtering. In the recent mail archives I couldn't find such a
>> discussion
>>>> regarding this thing. Any further reading or references would be
>>>> really appreciated.
>>>>
>>>>
>>>> Thanks and Regards,
>>>> Kasun
>>>>
>>>> [1] -
>>>>
>>>>
>> http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-
>>>> mahout-taste/
>>>>

Re: Regarding Collaborative Filtering.

Posted by Sean Owen <sr...@gmail.com>.

I think it's far from complete or done.

I think it would be interesting to take any of the MapReduce-based jobs, set
it up, run it, and benchmark/profile it to locate some bottlenecks, then
propose optimizations. It is a good way to get familiar with the packages.

You might also investigate suggested settings for Hadoop when running these
jobs.

These are just one type of way you could contribute. Looking into open
issues in JIRA, or adding unit tests, would be fine too.

On Thu, Jan 20, 2011 at 3:36 AM, Kasun Lakpriya
<ka...@gmail.com>wrote:

> Hi Sean,
> Thanks for the immediate reply and sorry for my late response.
>
> Our above mentioned project is in progress.
>
> BTW I realized that Mahout is quite interesting and very active project. I
> am just interested about contributing to Mahout. As understanding the
> complete code base is not an easy task I would like to start from some
> basic
> point. After getting familiar with the code base I can think of your
> suggestion about "improving its speed or reducing its memory/disk usage".
>
> So that what would be a good starting point?
>
> Thank you,
> Kasun
>
> On Thu, Dec 30, 2010 at 5:56 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Hi Kasun,
> >
> > If you want to get involved, you are free to discuss and propose your own
> > changes and algorithms. You can review the list of open issues here:
> > https://issues.apache.org/jira/browse/MAHOUT This contains some ideas
> > about
> > work that needs to be done.
> >
> > One interesting project would be to benchmark the existing distributed
> > item-based recommender and find ways to improve its speed or reduce its
> > memory/disk usage. That's a fairly simple starter project and quite
> useful.
> >
> > Sean
> >
> > On Wed, Dec 29, 2010 at 10:51 AM, Kasun Lakpriya <
> > kasun.lakpriya86@gmail.com
> > > wrote:
> >
> > > Hi all,
> > > I am Kasun Lakpriya from University of Moratuwa, Sri Lanka. I am
> > following
> > > a
> > > BSc in Computer Science and Engineering degree and now I am in my final
> > > year.
> > >
> > > In our degree program in order to complete the degree we need to do
> some
> > > kind of a research project approved by the university. The project I am
> > > working on is about "Web Personalization". The task is to develop a
> > > personalization module which is pluggable to any (theoretically) web
> > > application. After some literature survey we found out that there are
> > some
> > > existing open source tools we can use to implement this module
> > > (personalization module). Specially what we are focusing on is
> > > Collaborative
> > > Filtering. I have already checked out the mahout trunk and
> > > built successfully and tried this example I found on the web [1]. And I
> > > went
> > > through the wiki page related to Algorithms and found some nice
> > > presentation
> > > about "Distributed item based collaborative filtering" by Sebastian
> > > Schelter. And I went through some similarity measure implementations in
> > > Mahout.
> > >
> > > What I want from you all is some guidance and helping hand to start
> > > implementation on improving an algorithm already there in the Mahout or
> > > what
> > > are the other areas we can integrated to Mahout regarding to
> > Collaborative
> > > Filtering. In the recent mail archives I couldn't find such a
> discussion
> > > regarding this thing. Any further reading or references would be
> > > really appreciated.
> > >
> > >
> > > Thanks and Regards,
> > > Kasun
> > >
> > > [1] -
> > >
> > >
> >
> http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-
> > > mahout-taste/
> > >
> >
>

Re: Regarding Collaborative Filtering.

Posted by Kasun Lakpriya <ka...@gmail.com>.

Hi Sean,
Thanks for the immediate reply and sorry for my late response.

Our above mentioned project is in progress.

BTW I realized that Mahout is quite interesting and very active project. I
am just interested about contributing to Mahout. As understanding the
complete code base is not an easy task I would like to start from some basic
point. After getting familiar with the code base I can think of your
suggestion about "improving its speed or reducing its memory/disk usage".

So that what would be a good starting point?

Thank you,
Kasun

On Thu, Dec 30, 2010 at 5:56 PM, Sean Owen <sr...@gmail.com> wrote:

> Hi Kasun,
>
> If you want to get involved, you are free to discuss and propose your own
> changes and algorithms. You can review the list of open issues here:
> https://issues.apache.org/jira/browse/MAHOUT This contains some ideas
> about
> work that needs to be done.
>
> One interesting project would be to benchmark the existing distributed
> item-based recommender and find ways to improve its speed or reduce its
> memory/disk usage. That's a fairly simple starter project and quite useful.
>
> Sean
>
> On Wed, Dec 29, 2010 at 10:51 AM, Kasun Lakpriya <
> kasun.lakpriya86@gmail.com
> > wrote:
>
> > Hi all,
> > I am Kasun Lakpriya from University of Moratuwa, Sri Lanka. I am
> following
> > a
> > BSc in Computer Science and Engineering degree and now I am in my final
> > year.
> >
> > In our degree program in order to complete the degree we need to do some
> > kind of a research project approved by the university. The project I am
> > working on is about "Web Personalization". The task is to develop a
> > personalization module which is pluggable to any (theoretically) web
> > application. After some literature survey we found out that there are
> some
> > existing open source tools we can use to implement this module
> > (personalization module). Specially what we are focusing on is
> > Collaborative
> > Filtering. I have already checked out the mahout trunk and
> > built successfully and tried this example I found on the web [1]. And I
> > went
> > through the wiki page related to Algorithms and found some nice
> > presentation
> > about "Distributed item based collaborative filtering" by Sebastian
> > Schelter. And I went through some similarity measure implementations in
> > Mahout.
> >
> > What I want from you all is some guidance and helping hand to start
> > implementation on improving an algorithm already there in the Mahout or
> > what
> > are the other areas we can integrated to Mahout regarding to
> Collaborative
> > Filtering. In the recent mail archives I couldn't find such a discussion
> > regarding this thing. Any further reading or references would be
> > really appreciated.
> >
> >
> > Thanks and Regards,
> > Kasun
> >
> > [1] -
> >
> >
> http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-
> > mahout-taste/
> >
>

Re: Regarding Collaborative Filtering.

Posted by Sean Owen <sr...@gmail.com>.

Hi Kasun,

If you want to get involved, you are free to discuss and propose your own
changes and algorithms. You can review the list of open issues here:
https://issues.apache.org/jira/browse/MAHOUT This contains some ideas about
work that needs to be done.

One interesting project would be to benchmark the existing distributed
item-based recommender and find ways to improve its speed or reduce its
memory/disk usage. That's a fairly simple starter project and quite useful.

Sean

On Wed, Dec 29, 2010 at 10:51 AM, Kasun Lakpriya <kasun.lakpriya86@gmail.com
> wrote:

> Hi all,
> I am Kasun Lakpriya from University of Moratuwa, Sri Lanka. I am following
> a
> BSc in Computer Science and Engineering degree and now I am in my final
> year.
>
> In our degree program in order to complete the degree we need to do some
> kind of a research project approved by the university. The project I am
> working on is about "Web Personalization". The task is to develop a
> personalization module which is pluggable to any (theoretically) web
> application. After some literature survey we found out that there are some
> existing open source tools we can use to implement this module
> (personalization module). Specially what we are focusing on is
> Collaborative
> Filtering. I have already checked out the mahout trunk and
> built successfully and tried this example I found on the web [1]. And I
> went
> through the wiki page related to Algorithms and found some nice
> presentation
> about "Distributed item based collaborative filtering" by Sebastian
> Schelter. And I went through some similarity measure implementations in
> Mahout.
>
> What I want from you all is some guidance and helping hand to start
> implementation on improving an algorithm already there in the Mahout or
> what
> are the other areas we can integrated to Mahout regarding to Collaborative
> Filtering. In the recent mail archives I couldn't find such a discussion
> regarding this thing. Any further reading or references would be
> really appreciated.
>
>
> Thanks and Regards,
> Kasun
>
> [1] -
>
> http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-
> mahout-taste/
>