You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Gokhan Capan <gk...@gmail.com> on 2013/07/15 23:57:58 UTC

Regarding Online Recommenders

Based on the conversation in MAHOUT-1274, I put some code here:

https://github.com/gcapan/mahout/tree/onlinerec
I hope that would initiate a discussion on OnlineRecommender approaches.

I think the OnlineRecommender would require (similar to what Sebastian
commented there):

1- A DataModel that allows adding new users/items and performs fast
iteration
2- An online learning interface that allows updating the model with a
feedback, and make predictions based on the latest model

The code is a "very early" effort for the latter, and it contains a matrix
factorization-based implementation where training is done by SGD.

The model is stored in a DenseMatrix --it should be replaced with a matrix
that allows adding new rows and doesn't allocate space for empty rows
(please search for DenseRowMatrix and BlockSparseMatrix in the dev-list,
and see MAHOUT-1193 for relevant issue).

I didn't try that on a dataset yet.

The DataModel I imagine would follow the current API, where underlying
preference storage is replaced with a matrix.

A Recommender would then use the DataModel and the OnlineLearner, where
Recommender#setPreference is delegated to DataModel#setPreference (like it
does now), and DataModel#setPreference triggers OnlineLearner#train.

Gokhan

Re: Regarding Online Recommenders

Posted by Gokhan Capan <gk...@gmail.com>.
It is 2 SparseRowMatrices, Peng. But I don't want to comment on it before
actually trying it. This is essentially a first step for me to choose my
side on the DataModel implementation discussion:)

Gokhan

On Fri, Jul 19, 2013 at 2:25 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:

> Wow, that's lightning fast.
>
> Is it a SparseMatrix or DenseMatrix?
>
>
> On 13-07-18 07:23 PM, Gokhan Capan wrote:
>
>> I just started to implement a Matrix backed data model and pushed it, to
>> check the performance and memory considerations.
>>
>> I believe I can try it on some data tomorrow.
>>
>> Best
>>
>> Gokhan
>>
>>
>> On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <pc...@uowmail.edu.au>
>> wrote:
>>
>>  I see, sorry I was too presumptuous. I only recently worked and tested
>>> SVDRecommender, never could have known its efficiency using an item-based
>>> recommender. Maybe there is space for algorithmic optimization.
>>>
>>> The online recommender Gokhan is working on is also an SVDRecommender. An
>>> online user-based or item-based recommender based on clustering technique
>>> would definitely be critical, but we need an expert to volunteer :)
>>>
>>> Perhaps Dr Dunning can have a few words? He announced the online
>>> clustering component.
>>>
>>> Yours Peng
>>>
>>>
>>> On 13-07-18 03:54 PM, Pat Ferrel wrote:
>>>
>>>  No it was CPU bound not memory. I gave it something like 14G heap. It
>>>> was
>>>> running, just too slow to be of any real use. We switched to the hadoop
>>>> version and stored precalculated recs in a db for every user.
>>>>
>>>> On Jul 18, 2013, at 12:06 PM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>>>
>>>> Strange, its just a little bit larger than limibseti dataset (17m
>>>> ratings), did you encountered an outOfMemory or GCTimeOut exception?
>>>> Allocating more heap space usually help.
>>>>
>>>> Yours Peng
>>>>
>>>> On 13-07-18 02:27 PM, Pat Ferrel wrote:
>>>>
>>>>  It was about 2.5M users and 500K items with 25M actions over 6 months
>>>>> of
>>>>> data.
>>>>>
>>>>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>>>>
>>>>> If I remember right, a highlight of 0.8 release is an online clustering
>>>>> algorithm. I'm not sure if it can be used in item-based recommender,
>>>>> but
>>>>> this is definitely I would like to pursue. It's probably the only
>>>>> advantage
>>>>> a non-hadoop implementation can offer in the future.
>>>>>
>>>>> Many non-hadoop recommenders are pretty fast. But existing in-memory
>>>>> GenericDataModel and FileDataModel are largely implemented for
>>>>> sandboxes,
>>>>> IMHO they are the culprit of scalability problem.
>>>>>
>>>>> May I ask about the scale of your dataset? how many rating does it
>>>>> have?
>>>>>
>>>>> Yours Peng
>>>>>
>>>>> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>>>>>
>>>>>  Well, with itembased the only problem is new items. New users can
>>>>>> immediately be served by the model (although this is not well
>>>>>> supported
>>>>>> by
>>>>>> the API in Mahout). For the majority of usecases I saw, it is
>>>>>> perfectly
>>>>>> fine to have a short delay until new items "enter" the recommender,
>>>>>> usually
>>>>>> this happens after a retraining in batch. You have to care for
>>>>>> cold-start
>>>>>> and collect some interactions anyway.
>>>>>>
>>>>>>
>>>>>> 2013/7/18 Pat Ferrel <pa...@gmail.com>
>>>>>>
>>>>>>   Yes, what Myrrix does is good.
>>>>>>
>>>>>>> My last aside was a wish for an item-based online recommender not
>>>>>>> only
>>>>>>> factorized. Ted talks about using Solr for this, which we're
>>>>>>> experimenting
>>>>>>> with alongside Myrrix. I suspect Solr works but it does require a bit
>>>>>>> of
>>>>>>> tinkering and doesn't have quite the same set of options--no llr
>>>>>>> similarity
>>>>>>> for instance.
>>>>>>>
>>>>>>> On the same subject I recently attended a workshop in Seattle for
>>>>>>> UAI2013
>>>>>>> where Walmart reported similar results using a factorized
>>>>>>> recommender.
>>>>>>> They
>>>>>>> had to increase the factor number past where it would perform well.
>>>>>>> Along
>>>>>>> the way they saw increasing performance measuring precision offline.
>>>>>>> They
>>>>>>> eventually gave up on a factorized solution. This decision seems odd
>>>>>>> but
>>>>>>> anyway… In the case of Walmart and our data set they are quite
>>>>>>> diverse. The
>>>>>>> best idea is probably to create different recommenders for separate
>>>>>>> parts
>>>>>>> of the catalog but if you create one model on all items our intuition
>>>>>>> is
>>>>>>> that item-based works better than factorized. Again caveat--no A/B
>>>>>>> tests to
>>>>>>> support this yet.
>>>>>>>
>>>>>>> Doing an online item-based recommender would quickly run into scaling
>>>>>>> problems, no? We put together the simple Mahout in-memory version and
>>>>>>> it
>>>>>>> could not really handle more than a down-sampled few months of our
>>>>>>> data.
>>>>>>> Down-sampling lost us 20% of our precision scores so we moved to the
>>>>>>> hadoop
>>>>>>> version. Now we have use-cases for an online recommender that handles
>>>>>>> anonymous new users and that takes the story full circle.
>>>>>>>
>>>>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Pat
>>>>>>>
>>>>>>> I think we should provide a simple support for recommending to
>>>>>>> anonymous
>>>>>>> users. We should have a method recommendToAnonymous() that takes a
>>>>>>> PreferenceArray as argument. For itembased recommenders, its
>>>>>>> straightforward to compute recommendations, for userbased you have to
>>>>>>> search through all users once, for latent factor models, you have to
>>>>>>> fold
>>>>>>> the user vector into the low dimensional space.
>>>>>>>
>>>>>>> I think Sean already added this method in myrrix and I have some code
>>>>>>> for
>>>>>>> my kornakapi project (a simple weblayer for mahout).
>>>>>>>
>>>>>>> Would such a method fit your needs?
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>>>>>>>
>>>>>>>   May I ask how you plan to support model updates and 'anonymous'
>>>>>>> users?
>>>>>>>
>>>>>>>> I assume the latent factors model is calculated offline still in
>>>>>>>> batch
>>>>>>>> mode, then there are periodic updates? How are the updates handled?
>>>>>>>> Do
>>>>>>>>
>>>>>>>>  you
>>>>>>>
>>>>>>>  plan to require batch model refactorization for any update? Or
>>>>>>>> perform
>>>>>>>>
>>>>>>>>  some
>>>>>>>
>>>>>>>  partial update by maybe just transforming new data into the LF space
>>>>>>>> already in place then doing full refactorization every so often in
>>>>>>>> batch
>>>>>>>> mode?
>>>>>>>>
>>>>>>>> By 'anonymous users' I mean users with some history that is not yet
>>>>>>>> incorporated in the LF model. This could be history from a new user
>>>>>>>> asked
>>>>>>>> to pick a few items to start the rec process, or an old user with
>>>>>>>> some
>>>>>>>>
>>>>>>>>  new
>>>>>>>
>>>>>>>  action history not yet in the model. Are you going to allow for
>>>>>>>> passing
>>>>>>>>
>>>>>>>>  the
>>>>>>>
>>>>>>>  entire history vector or userID+incremental new history to the
>>>>>>>>
>>>>>>>>  recommender?
>>>>>>>
>>>>>>>  I hope so.
>>>>>>>>
>>>>>>>> For what it's worth we did a comparison of Mahout Item based CF to
>>>>>>>> Mahout
>>>>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6
>>>>>>>> months
>>>>>>>>
>>>>>>>>  of
>>>>>>>
>>>>>>>  data. The data was purchase data from a diverse ecom source with a
>>>>>>>> large
>>>>>>>> variety of products from electronics to clothes. We found Item based
>>>>>>>> CF
>>>>>>>>
>>>>>>>>  did
>>>>>>>
>>>>>>>  far better than ALS. As we increased the number of latent factors
>>>>>>>> the
>>>>>>>> results got better but were never within 10% of item based (we used
>>>>>>>> MAP
>>>>>>>>
>>>>>>>>  as
>>>>>>>
>>>>>>>  the offline metric). Not sure why but maybe it has to do with the
>>>>>>>>
>>>>>>>>  diversity
>>>>>>>
>>>>>>>  of the item types.
>>>>>>>>
>>>>>>>> I understand that a full item based online recommender has very
>>>>>>>> different
>>>>>>>> tradeoffs and anyway others may not have seen this disparity of
>>>>>>>> results.
>>>>>>>> Furthermore we don't have A/B test results yet to validate the
>>>>>>>> offline
>>>>>>>> metric.
>>>>>>>>
>>>>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Peng,
>>>>>>>>
>>>>>>>> This is the reason I separated out the DataModel, and only put the
>>>>>>>>
>>>>>>>>  learner
>>>>>>>
>>>>>>>  stuff there. The learner I mentioned yesterday just stores the
>>>>>>>> parameters, (noOfUsers+noOfItems)*****noOfLatentFactors, and does
>>>>>>>> not
>>>>>>>>
>>>>>>>> care
>>>>>>>> where preferences are stored.
>>>>>>>>
>>>>>>>> I, kind of, agree with the multi-level DataModel approach:
>>>>>>>> One for iterating over "all" preferences, one for if one wants to
>>>>>>>> deploy
>>>>>>>>
>>>>>>>>  a
>>>>>>>
>>>>>>>  recommender and perform a lot of top-N recommendation tasks.
>>>>>>>>
>>>>>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>>>>>> consumption, while still providing fast access, I am not sure. Let
>>>>>>>> me
>>>>>>>>
>>>>>>>>  try a
>>>>>>>
>>>>>>>  matrix-backed DataModel approach)
>>>>>>>>
>>>>>>>> Gokhan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ssc@apache.org
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I completely agree, Netflix is less than one gigabye in a smart
>>>>>>>>
>>>>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>>>>>>
>>>>>>>>>  representation,
>>>>>>>>
>>>>>>>>  tested on KDD Music dataset which is approx 2.5 times Netflix and
>>>>>>>>> fits
>>>>>>>>>
>>>>>>>>>  into
>>>>>>>>
>>>>>>>>  3GB with that approach.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>>>>>>
>>>>>>>>>   Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>>>>
>>>>>>>>>> Note also that this is before you have done any work.
>>>>>>>>>>
>>>>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc175@uowmail.edu.au
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>  The second idea is indeed splendid, we should separate
>>>>>>>>>>
>>>>>>>>>>> time-complexity
>>>>>>>>>>> first and space-complexity first implementation. What I'm not
>>>>>>>>>>> quite
>>>>>>>>>>>
>>>>>>>>>>>  sure,
>>>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>>>>
>>>>>>>>>>> Personally, I think 12G heap space is not that high right? Most
>>>>>>>>>>> new
>>>>>>>>>>>
>>>>>>>>>>>  laptop
>>>>>>>>>>
>>>>>>>>>>  can already handle that (emphasis on laptop). And if we replace
>>>>>>>>>>> hash
>>>>>>>>>>>
>>>>>>>>>>>  map
>>>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>>   simply degrade time complexity for a linear search to O(n), not
>>>>>>>> too
>>>>>>>>
>>>>>>>>> bad
>>>>>>>>>>
>>>>>>>>>   either. The current DataModel is a result of careful thoughts
>>>>>>>> and has
>>>>>>>>
>>>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>>>>>> instead
>>>>>>>>>>>
>>>>>>>>>>>  of
>>>>>>>>>> subverting it.
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>
>>>>
>>>
>
>

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
Wow, that's lightning fast.

Is it a SparseMatrix or DenseMatrix?

On 13-07-18 07:23 PM, Gokhan Capan wrote:
> I just started to implement a Matrix backed data model and pushed it, to
> check the performance and memory considerations.
>
> I believe I can try it on some data tomorrow.
>
> Best
>
> Gokhan
>
>
> On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>
>> I see, sorry I was too presumptuous. I only recently worked and tested
>> SVDRecommender, never could have known its efficiency using an item-based
>> recommender. Maybe there is space for algorithmic optimization.
>>
>> The online recommender Gokhan is working on is also an SVDRecommender. An
>> online user-based or item-based recommender based on clustering technique
>> would definitely be critical, but we need an expert to volunteer :)
>>
>> Perhaps Dr Dunning can have a few words? He announced the online
>> clustering component.
>>
>> Yours Peng
>>
>>
>> On 13-07-18 03:54 PM, Pat Ferrel wrote:
>>
>>> No it was CPU bound not memory. I gave it something like 14G heap. It was
>>> running, just too slow to be of any real use. We switched to the hadoop
>>> version and stored precalculated recs in a db for every user.
>>>
>>> On Jul 18, 2013, at 12:06 PM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>>
>>> Strange, its just a little bit larger than limibseti dataset (17m
>>> ratings), did you encountered an outOfMemory or GCTimeOut exception?
>>> Allocating more heap space usually help.
>>>
>>> Yours Peng
>>>
>>> On 13-07-18 02:27 PM, Pat Ferrel wrote:
>>>
>>>> It was about 2.5M users and 500K items with 25M actions over 6 months of
>>>> data.
>>>>
>>>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>>>
>>>> If I remember right, a highlight of 0.8 release is an online clustering
>>>> algorithm. I'm not sure if it can be used in item-based recommender, but
>>>> this is definitely I would like to pursue. It's probably the only advantage
>>>> a non-hadoop implementation can offer in the future.
>>>>
>>>> Many non-hadoop recommenders are pretty fast. But existing in-memory
>>>> GenericDataModel and FileDataModel are largely implemented for sandboxes,
>>>> IMHO they are the culprit of scalability problem.
>>>>
>>>> May I ask about the scale of your dataset? how many rating does it have?
>>>>
>>>> Yours Peng
>>>>
>>>> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>>>>
>>>>> Well, with itembased the only problem is new items. New users can
>>>>> immediately be served by the model (although this is not well supported
>>>>> by
>>>>> the API in Mahout). For the majority of usecases I saw, it is perfectly
>>>>> fine to have a short delay until new items "enter" the recommender,
>>>>> usually
>>>>> this happens after a retraining in batch. You have to care for
>>>>> cold-start
>>>>> and collect some interactions anyway.
>>>>>
>>>>>
>>>>> 2013/7/18 Pat Ferrel <pa...@gmail.com>
>>>>>
>>>>>   Yes, what Myrrix does is good.
>>>>>> My last aside was a wish for an item-based online recommender not only
>>>>>> factorized. Ted talks about using Solr for this, which we're
>>>>>> experimenting
>>>>>> with alongside Myrrix. I suspect Solr works but it does require a bit
>>>>>> of
>>>>>> tinkering and doesn't have quite the same set of options--no llr
>>>>>> similarity
>>>>>> for instance.
>>>>>>
>>>>>> On the same subject I recently attended a workshop in Seattle for
>>>>>> UAI2013
>>>>>> where Walmart reported similar results using a factorized recommender.
>>>>>> They
>>>>>> had to increase the factor number past where it would perform well.
>>>>>> Along
>>>>>> the way they saw increasing performance measuring precision offline.
>>>>>> They
>>>>>> eventually gave up on a factorized solution. This decision seems odd
>>>>>> but
>>>>>> anyway… In the case of Walmart and our data set they are quite
>>>>>> diverse. The
>>>>>> best idea is probably to create different recommenders for separate
>>>>>> parts
>>>>>> of the catalog but if you create one model on all items our intuition
>>>>>> is
>>>>>> that item-based works better than factorized. Again caveat--no A/B
>>>>>> tests to
>>>>>> support this yet.
>>>>>>
>>>>>> Doing an online item-based recommender would quickly run into scaling
>>>>>> problems, no? We put together the simple Mahout in-memory version and
>>>>>> it
>>>>>> could not really handle more than a down-sampled few months of our
>>>>>> data.
>>>>>> Down-sampling lost us 20% of our precision scores so we moved to the
>>>>>> hadoop
>>>>>> version. Now we have use-cases for an online recommender that handles
>>>>>> anonymous new users and that takes the story full circle.
>>>>>>
>>>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Pat
>>>>>>
>>>>>> I think we should provide a simple support for recommending to
>>>>>> anonymous
>>>>>> users. We should have a method recommendToAnonymous() that takes a
>>>>>> PreferenceArray as argument. For itembased recommenders, its
>>>>>> straightforward to compute recommendations, for userbased you have to
>>>>>> search through all users once, for latent factor models, you have to
>>>>>> fold
>>>>>> the user vector into the low dimensional space.
>>>>>>
>>>>>> I think Sean already added this method in myrrix and I have some code
>>>>>> for
>>>>>> my kornakapi project (a simple weblayer for mahout).
>>>>>>
>>>>>> Would such a method fit your needs?
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>>>>>>
>>>>>>   May I ask how you plan to support model updates and 'anonymous' users?
>>>>>>> I assume the latent factors model is calculated offline still in batch
>>>>>>> mode, then there are periodic updates? How are the updates handled? Do
>>>>>>>
>>>>>> you
>>>>>>
>>>>>>> plan to require batch model refactorization for any update? Or perform
>>>>>>>
>>>>>> some
>>>>>>
>>>>>>> partial update by maybe just transforming new data into the LF space
>>>>>>> already in place then doing full refactorization every so often in
>>>>>>> batch
>>>>>>> mode?
>>>>>>>
>>>>>>> By 'anonymous users' I mean users with some history that is not yet
>>>>>>> incorporated in the LF model. This could be history from a new user
>>>>>>> asked
>>>>>>> to pick a few items to start the rec process, or an old user with some
>>>>>>>
>>>>>> new
>>>>>>
>>>>>>> action history not yet in the model. Are you going to allow for
>>>>>>> passing
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> entire history vector or userID+incremental new history to the
>>>>>>>
>>>>>> recommender?
>>>>>>
>>>>>>> I hope so.
>>>>>>>
>>>>>>> For what it's worth we did a comparison of Mahout Item based CF to
>>>>>>> Mahout
>>>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6
>>>>>>> months
>>>>>>>
>>>>>> of
>>>>>>
>>>>>>> data. The data was purchase data from a diverse ecom source with a
>>>>>>> large
>>>>>>> variety of products from electronics to clothes. We found Item based
>>>>>>> CF
>>>>>>>
>>>>>> did
>>>>>>
>>>>>>> far better than ALS. As we increased the number of latent factors the
>>>>>>> results got better but were never within 10% of item based (we used
>>>>>>> MAP
>>>>>>>
>>>>>> as
>>>>>>
>>>>>>> the offline metric). Not sure why but maybe it has to do with the
>>>>>>>
>>>>>> diversity
>>>>>>
>>>>>>> of the item types.
>>>>>>>
>>>>>>> I understand that a full item based online recommender has very
>>>>>>> different
>>>>>>> tradeoffs and anyway others may not have seen this disparity of
>>>>>>> results.
>>>>>>> Furthermore we don't have A/B test results yet to validate the offline
>>>>>>> metric.
>>>>>>>
>>>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>>>>
>>>>>>> Peng,
>>>>>>>
>>>>>>> This is the reason I separated out the DataModel, and only put the
>>>>>>>
>>>>>> learner
>>>>>>
>>>>>>> stuff there. The learner I mentioned yesterday just stores the
>>>>>>> parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not
>>>>>>> care
>>>>>>> where preferences are stored.
>>>>>>>
>>>>>>> I, kind of, agree with the multi-level DataModel approach:
>>>>>>> One for iterating over "all" preferences, one for if one wants to
>>>>>>> deploy
>>>>>>>
>>>>>> a
>>>>>>
>>>>>>> recommender and perform a lot of top-N recommendation tasks.
>>>>>>>
>>>>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>>>>> consumption, while still providing fast access, I am not sure. Let me
>>>>>>>
>>>>>> try a
>>>>>>
>>>>>>> matrix-backed DataModel approach)
>>>>>>>
>>>>>>> Gokhan
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   I completely agree, Netflix is less than one gigabye in a smart
>>>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>>>>>
>>>>>>> representation,
>>>>>>>
>>>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and
>>>>>>>> fits
>>>>>>>>
>>>>>>> into
>>>>>>>
>>>>>>>> 3GB with that approach.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>>>>>
>>>>>>>>   Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>>>> Note also that this is before you have done any work.
>>>>>>>>>
>>>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>>>
>>>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The second idea is indeed splendid, we should separate
>>>>>>>>>> time-complexity
>>>>>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>>>>>>>
>>>>>>>>> sure,
>>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>>>>>
>>>>>>>>> laptop
>>>>>>>>>
>>>>>>>>>> can already handle that (emphasis on laptop). And if we replace
>>>>>>>>>> hash
>>>>>>>>>>
>>>>>>>>> map
>>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>>>> would
>>>>>>>   simply degrade time complexity for a linear search to O(n), not too
>>>>>>>>> bad
>>>>>>>   either. The current DataModel is a result of careful thoughts and has
>>>>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>>>>> instead
>>>>>>>>>>
>>>>>>>>> of
>>>>>>>>> subverting it.
>>>>
>>>
>>>
>>



Re: Regarding Online Recommenders

Posted by Gokhan Capan <gk...@gmail.com>.
I just started to implement a Matrix backed data model and pushed it, to
check the performance and memory considerations.

I believe I can try it on some data tomorrow.

Best

Gokhan


On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <pc...@uowmail.edu.au> wrote:

> I see, sorry I was too presumptuous. I only recently worked and tested
> SVDRecommender, never could have known its efficiency using an item-based
> recommender. Maybe there is space for algorithmic optimization.
>
> The online recommender Gokhan is working on is also an SVDRecommender. An
> online user-based or item-based recommender based on clustering technique
> would definitely be critical, but we need an expert to volunteer :)
>
> Perhaps Dr Dunning can have a few words? He announced the online
> clustering component.
>
> Yours Peng
>
>
> On 13-07-18 03:54 PM, Pat Ferrel wrote:
>
>> No it was CPU bound not memory. I gave it something like 14G heap. It was
>> running, just too slow to be of any real use. We switched to the hadoop
>> version and stored precalculated recs in a db for every user.
>>
>> On Jul 18, 2013, at 12:06 PM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>
>> Strange, its just a little bit larger than limibseti dataset (17m
>> ratings), did you encountered an outOfMemory or GCTimeOut exception?
>> Allocating more heap space usually help.
>>
>> Yours Peng
>>
>> On 13-07-18 02:27 PM, Pat Ferrel wrote:
>>
>>> It was about 2.5M users and 500K items with 25M actions over 6 months of
>>> data.
>>>
>>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>>
>>> If I remember right, a highlight of 0.8 release is an online clustering
>>> algorithm. I'm not sure if it can be used in item-based recommender, but
>>> this is definitely I would like to pursue. It's probably the only advantage
>>> a non-hadoop implementation can offer in the future.
>>>
>>> Many non-hadoop recommenders are pretty fast. But existing in-memory
>>> GenericDataModel and FileDataModel are largely implemented for sandboxes,
>>> IMHO they are the culprit of scalability problem.
>>>
>>> May I ask about the scale of your dataset? how many rating does it have?
>>>
>>> Yours Peng
>>>
>>> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>>>
>>>> Well, with itembased the only problem is new items. New users can
>>>> immediately be served by the model (although this is not well supported
>>>> by
>>>> the API in Mahout). For the majority of usecases I saw, it is perfectly
>>>> fine to have a short delay until new items "enter" the recommender,
>>>> usually
>>>> this happens after a retraining in batch. You have to care for
>>>> cold-start
>>>> and collect some interactions anyway.
>>>>
>>>>
>>>> 2013/7/18 Pat Ferrel <pa...@gmail.com>
>>>>
>>>>  Yes, what Myrrix does is good.
>>>>>
>>>>> My last aside was a wish for an item-based online recommender not only
>>>>> factorized. Ted talks about using Solr for this, which we're
>>>>> experimenting
>>>>> with alongside Myrrix. I suspect Solr works but it does require a bit
>>>>> of
>>>>> tinkering and doesn't have quite the same set of options--no llr
>>>>> similarity
>>>>> for instance.
>>>>>
>>>>> On the same subject I recently attended a workshop in Seattle for
>>>>> UAI2013
>>>>> where Walmart reported similar results using a factorized recommender.
>>>>> They
>>>>> had to increase the factor number past where it would perform well.
>>>>> Along
>>>>> the way they saw increasing performance measuring precision offline.
>>>>> They
>>>>> eventually gave up on a factorized solution. This decision seems odd
>>>>> but
>>>>> anyway… In the case of Walmart and our data set they are quite
>>>>> diverse. The
>>>>> best idea is probably to create different recommenders for separate
>>>>> parts
>>>>> of the catalog but if you create one model on all items our intuition
>>>>> is
>>>>> that item-based works better than factorized. Again caveat--no A/B
>>>>> tests to
>>>>> support this yet.
>>>>>
>>>>> Doing an online item-based recommender would quickly run into scaling
>>>>> problems, no? We put together the simple Mahout in-memory version and
>>>>> it
>>>>> could not really handle more than a down-sampled few months of our
>>>>> data.
>>>>> Down-sampling lost us 20% of our precision scores so we moved to the
>>>>> hadoop
>>>>> version. Now we have use-cases for an online recommender that handles
>>>>> anonymous new users and that takes the story full circle.
>>>>>
>>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org>
>>>>> wrote:
>>>>>
>>>>> Hi Pat
>>>>>
>>>>> I think we should provide a simple support for recommending to
>>>>> anonymous
>>>>> users. We should have a method recommendToAnonymous() that takes a
>>>>> PreferenceArray as argument. For itembased recommenders, its
>>>>> straightforward to compute recommendations, for userbased you have to
>>>>> search through all users once, for latent factor models, you have to
>>>>> fold
>>>>> the user vector into the low dimensional space.
>>>>>
>>>>> I think Sean already added this method in myrrix and I have some code
>>>>> for
>>>>> my kornakapi project (a simple weblayer for mahout).
>>>>>
>>>>> Would such a method fit your needs?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>>
>>>>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>>>>>
>>>>>  May I ask how you plan to support model updates and 'anonymous' users?
>>>>>>
>>>>>> I assume the latent factors model is calculated offline still in batch
>>>>>> mode, then there are periodic updates? How are the updates handled? Do
>>>>>>
>>>>> you
>>>>>
>>>>>> plan to require batch model refactorization for any update? Or perform
>>>>>>
>>>>> some
>>>>>
>>>>>> partial update by maybe just transforming new data into the LF space
>>>>>> already in place then doing full refactorization every so often in
>>>>>> batch
>>>>>> mode?
>>>>>>
>>>>>> By 'anonymous users' I mean users with some history that is not yet
>>>>>> incorporated in the LF model. This could be history from a new user
>>>>>> asked
>>>>>> to pick a few items to start the rec process, or an old user with some
>>>>>>
>>>>> new
>>>>>
>>>>>> action history not yet in the model. Are you going to allow for
>>>>>> passing
>>>>>>
>>>>> the
>>>>>
>>>>>> entire history vector or userID+incremental new history to the
>>>>>>
>>>>> recommender?
>>>>>
>>>>>> I hope so.
>>>>>>
>>>>>> For what it's worth we did a comparison of Mahout Item based CF to
>>>>>> Mahout
>>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6
>>>>>> months
>>>>>>
>>>>> of
>>>>>
>>>>>> data. The data was purchase data from a diverse ecom source with a
>>>>>> large
>>>>>> variety of products from electronics to clothes. We found Item based
>>>>>> CF
>>>>>>
>>>>> did
>>>>>
>>>>>> far better than ALS. As we increased the number of latent factors the
>>>>>> results got better but were never within 10% of item based (we used
>>>>>> MAP
>>>>>>
>>>>> as
>>>>>
>>>>>> the offline metric). Not sure why but maybe it has to do with the
>>>>>>
>>>>> diversity
>>>>>
>>>>>> of the item types.
>>>>>>
>>>>>> I understand that a full item based online recommender has very
>>>>>> different
>>>>>> tradeoffs and anyway others may not have seen this disparity of
>>>>>> results.
>>>>>> Furthermore we don't have A/B test results yet to validate the offline
>>>>>> metric.
>>>>>>
>>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>>>
>>>>>> Peng,
>>>>>>
>>>>>> This is the reason I separated out the DataModel, and only put the
>>>>>>
>>>>> learner
>>>>>
>>>>>> stuff there. The learner I mentioned yesterday just stores the
>>>>>> parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not
>>>>>> care
>>>>>> where preferences are stored.
>>>>>>
>>>>>> I, kind of, agree with the multi-level DataModel approach:
>>>>>> One for iterating over "all" preferences, one for if one wants to
>>>>>> deploy
>>>>>>
>>>>> a
>>>>>
>>>>>> recommender and perform a lot of top-N recommendation tasks.
>>>>>>
>>>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>>>> consumption, while still providing fast access, I am not sure. Let me
>>>>>>
>>>>> try a
>>>>>
>>>>>> matrix-backed DataModel approach)
>>>>>>
>>>>>> Gokhan
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>  I completely agree, Netflix is less than one gigabye in a smart
>>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>>>>
>>>>>> representation,
>>>>>>
>>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and
>>>>>>> fits
>>>>>>>
>>>>>> into
>>>>>>
>>>>>>> 3GB with that approach.
>>>>>>>
>>>>>>>
>>>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>>>>
>>>>>>>  Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>>>
>>>>>>>> Note also that this is before you have done any work.
>>>>>>>>
>>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>>
>>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The second idea is indeed splendid, we should separate
>>>>>>>>> time-complexity
>>>>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>>>>>>
>>>>>>>> sure,
>>>>>>>
>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>>>>
>>>>>>>> laptop
>>>>>>>>
>>>>>>>>> can already handle that (emphasis on laptop). And if we replace
>>>>>>>>> hash
>>>>>>>>>
>>>>>>>> map
>>>>>>>
>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>>>>
>>>>>>>> would
>>>>>
>>>>>>  simply degrade time complexity for a linear search to O(n), not too
>>>>>>>>>
>>>>>>>> bad
>>>>>
>>>>>>  either. The current DataModel is a result of careful thoughts and has
>>>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>>>> instead
>>>>>>>>>
>>>>>>>> of
>>>>>>>
>>>>>>>> subverting it.
>>>>>>>>>
>>>>>>>>
>>>
>>>
>>
>>
>>
>
>

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
I see, sorry I was too presumptuous. I only recently worked and tested 
SVDRecommender, never could have known its efficiency using an 
item-based recommender. Maybe there is space for algorithmic optimization.

The online recommender Gokhan is working on is also an SVDRecommender. 
An online user-based or item-based recommender based on clustering 
technique would definitely be critical, but we need an expert to 
volunteer :)

Perhaps Dr Dunning can have a few words? He announced the online 
clustering component.

Yours Peng

On 13-07-18 03:54 PM, Pat Ferrel wrote:
> No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user.
>
> On Jul 18, 2013, at 12:06 PM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>
> Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help.
>
> Yours Peng
>
> On 13-07-18 02:27 PM, Pat Ferrel wrote:
>> It was about 2.5M users and 500K items with 25M actions over 6 months of data.
>>
>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>
>> If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future.
>>
>> Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem.
>>
>> May I ask about the scale of your dataset? how many rating does it have?
>>
>> Yours Peng
>>
>> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>>> Well, with itembased the only problem is new items. New users can
>>> immediately be served by the model (although this is not well supported by
>>> the API in Mahout). For the majority of usecases I saw, it is perfectly
>>> fine to have a short delay until new items "enter" the recommender, usually
>>> this happens after a retraining in batch. You have to care for cold-start
>>> and collect some interactions anyway.
>>>
>>>
>>> 2013/7/18 Pat Ferrel <pa...@gmail.com>
>>>
>>>> Yes, what Myrrix does is good.
>>>>
>>>> My last aside was a wish for an item-based online recommender not only
>>>> factorized. Ted talks about using Solr for this, which we're experimenting
>>>> with alongside Myrrix. I suspect Solr works but it does require a bit of
>>>> tinkering and doesn't have quite the same set of options--no llr similarity
>>>> for instance.
>>>>
>>>> On the same subject I recently attended a workshop in Seattle for UAI2013
>>>> where Walmart reported similar results using a factorized recommender. They
>>>> had to increase the factor number past where it would perform well. Along
>>>> the way they saw increasing performance measuring precision offline. They
>>>> eventually gave up on a factorized solution. This decision seems odd but
>>>> anyway… In the case of Walmart and our data set they are quite diverse. The
>>>> best idea is probably to create different recommenders for separate parts
>>>> of the catalog but if you create one model on all items our intuition is
>>>> that item-based works better than factorized. Again caveat--no A/B tests to
>>>> support this yet.
>>>>
>>>> Doing an online item-based recommender would quickly run into scaling
>>>> problems, no? We put together the simple Mahout in-memory version and it
>>>> could not really handle more than a down-sampled few months of our data.
>>>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
>>>> version. Now we have use-cases for an online recommender that handles
>>>> anonymous new users and that takes the story full circle.
>>>>
>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>
>>>> Hi Pat
>>>>
>>>> I think we should provide a simple support for recommending to anonymous
>>>> users. We should have a method recommendToAnonymous() that takes a
>>>> PreferenceArray as argument. For itembased recommenders, its
>>>> straightforward to compute recommendations, for userbased you have to
>>>> search through all users once, for latent factor models, you have to fold
>>>> the user vector into the low dimensional space.
>>>>
>>>> I think Sean already added this method in myrrix and I have some code for
>>>> my kornakapi project (a simple weblayer for mahout).
>>>>
>>>> Would such a method fit your needs?
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>>
>>>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>>>>
>>>>> May I ask how you plan to support model updates and 'anonymous' users?
>>>>>
>>>>> I assume the latent factors model is calculated offline still in batch
>>>>> mode, then there are periodic updates? How are the updates handled? Do
>>>> you
>>>>> plan to require batch model refactorization for any update? Or perform
>>>> some
>>>>> partial update by maybe just transforming new data into the LF space
>>>>> already in place then doing full refactorization every so often in batch
>>>>> mode?
>>>>>
>>>>> By 'anonymous users' I mean users with some history that is not yet
>>>>> incorporated in the LF model. This could be history from a new user asked
>>>>> to pick a few items to start the rec process, or an old user with some
>>>> new
>>>>> action history not yet in the model. Are you going to allow for passing
>>>> the
>>>>> entire history vector or userID+incremental new history to the
>>>> recommender?
>>>>> I hope so.
>>>>>
>>>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout
>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
>>>> of
>>>>> data. The data was purchase data from a diverse ecom source with a large
>>>>> variety of products from electronics to clothes. We found Item based CF
>>>> did
>>>>> far better than ALS. As we increased the number of latent factors the
>>>>> results got better but were never within 10% of item based (we used MAP
>>>> as
>>>>> the offline metric). Not sure why but maybe it has to do with the
>>>> diversity
>>>>> of the item types.
>>>>>
>>>>> I understand that a full item based online recommender has very different
>>>>> tradeoffs and anyway others may not have seen this disparity of results.
>>>>> Furthermore we don't have A/B test results yet to validate the offline
>>>>> metric.
>>>>>
>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>>
>>>>> Peng,
>>>>>
>>>>> This is the reason I separated out the DataModel, and only put the
>>>> learner
>>>>> stuff there. The learner I mentioned yesterday just stores the
>>>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>>>>> where preferences are stored.
>>>>>
>>>>> I, kind of, agree with the multi-level DataModel approach:
>>>>> One for iterating over "all" preferences, one for if one wants to deploy
>>>> a
>>>>> recommender and perform a lot of top-N recommendation tasks.
>>>>>
>>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>>> consumption, while still providing fast access, I am not sure. Let me
>>>> try a
>>>>> matrix-backed DataModel approach)
>>>>>
>>>>> Gokhan
>>>>>
>>>>>
>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I completely agree, Netflix is less than one gigabye in a smart
>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>> representation,
>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>>> into
>>>>>> 3GB with that approach.
>>>>>>
>>>>>>
>>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>>>
>>>>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>>
>>>>>>> Note also that this is before you have done any work.
>>>>>>>
>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>
>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>>> wrote:
>>>>>>>> The second idea is indeed splendid, we should separate time-complexity
>>>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>>> sure,
>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>> laptop
>>>>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>>>>> map
>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>> would
>>>>>>>> simply degrade time complexity for a linear search to O(n), not too
>>>> bad
>>>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>>>> of
>>>>>>>> subverting it.
>>
>>
>
>
>



Re: Regarding Online Recommenders

Posted by Pat Ferrel <pa...@gmail.com>.
No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng <pc...@uowmail.edu.au> wrote:

Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:
> It was about 2.5M users and 500K items with 25M actions over 6 months of data.
> 
> On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
> 
> If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future.
> 
> Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem.
> 
> May I ask about the scale of your dataset? how many rating does it have?
> 
> Yours Peng
> 
> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>> Well, with itembased the only problem is new items. New users can
>> immediately be served by the model (although this is not well supported by
>> the API in Mahout). For the majority of usecases I saw, it is perfectly
>> fine to have a short delay until new items "enter" the recommender, usually
>> this happens after a retraining in batch. You have to care for cold-start
>> and collect some interactions anyway.
>> 
>> 
>> 2013/7/18 Pat Ferrel <pa...@gmail.com>
>> 
>>> Yes, what Myrrix does is good.
>>> 
>>> My last aside was a wish for an item-based online recommender not only
>>> factorized. Ted talks about using Solr for this, which we're experimenting
>>> with alongside Myrrix. I suspect Solr works but it does require a bit of
>>> tinkering and doesn't have quite the same set of options--no llr similarity
>>> for instance.
>>> 
>>> On the same subject I recently attended a workshop in Seattle for UAI2013
>>> where Walmart reported similar results using a factorized recommender. They
>>> had to increase the factor number past where it would perform well. Along
>>> the way they saw increasing performance measuring precision offline. They
>>> eventually gave up on a factorized solution. This decision seems odd but
>>> anyway… In the case of Walmart and our data set they are quite diverse. The
>>> best idea is probably to create different recommenders for separate parts
>>> of the catalog but if you create one model on all items our intuition is
>>> that item-based works better than factorized. Again caveat--no A/B tests to
>>> support this yet.
>>> 
>>> Doing an online item-based recommender would quickly run into scaling
>>> problems, no? We put together the simple Mahout in-memory version and it
>>> could not really handle more than a down-sampled few months of our data.
>>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
>>> version. Now we have use-cases for an online recommender that handles
>>> anonymous new users and that takes the story full circle.
>>> 
>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>> 
>>> Hi Pat
>>> 
>>> I think we should provide a simple support for recommending to anonymous
>>> users. We should have a method recommendToAnonymous() that takes a
>>> PreferenceArray as argument. For itembased recommenders, its
>>> straightforward to compute recommendations, for userbased you have to
>>> search through all users once, for latent factor models, you have to fold
>>> the user vector into the low dimensional space.
>>> 
>>> I think Sean already added this method in myrrix and I have some code for
>>> my kornakapi project (a simple weblayer for mahout).
>>> 
>>> Would such a method fit your needs?
>>> 
>>> Best,
>>> Sebastian
>>> 
>>> 
>>> 
>>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>>> 
>>>> May I ask how you plan to support model updates and 'anonymous' users?
>>>> 
>>>> I assume the latent factors model is calculated offline still in batch
>>>> mode, then there are periodic updates? How are the updates handled? Do
>>> you
>>>> plan to require batch model refactorization for any update? Or perform
>>> some
>>>> partial update by maybe just transforming new data into the LF space
>>>> already in place then doing full refactorization every so often in batch
>>>> mode?
>>>> 
>>>> By 'anonymous users' I mean users with some history that is not yet
>>>> incorporated in the LF model. This could be history from a new user asked
>>>> to pick a few items to start the rec process, or an old user with some
>>> new
>>>> action history not yet in the model. Are you going to allow for passing
>>> the
>>>> entire history vector or userID+incremental new history to the
>>> recommender?
>>>> I hope so.
>>>> 
>>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout
>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
>>> of
>>>> data. The data was purchase data from a diverse ecom source with a large
>>>> variety of products from electronics to clothes. We found Item based CF
>>> did
>>>> far better than ALS. As we increased the number of latent factors the
>>>> results got better but were never within 10% of item based (we used MAP
>>> as
>>>> the offline metric). Not sure why but maybe it has to do with the
>>> diversity
>>>> of the item types.
>>>> 
>>>> I understand that a full item based online recommender has very different
>>>> tradeoffs and anyway others may not have seen this disparity of results.
>>>> Furthermore we don't have A/B test results yet to validate the offline
>>>> metric.
>>>> 
>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>> 
>>>> Peng,
>>>> 
>>>> This is the reason I separated out the DataModel, and only put the
>>> learner
>>>> stuff there. The learner I mentioned yesterday just stores the
>>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>>>> where preferences are stored.
>>>> 
>>>> I, kind of, agree with the multi-level DataModel approach:
>>>> One for iterating over "all" preferences, one for if one wants to deploy
>>> a
>>>> recommender and perform a lot of top-N recommendation tasks.
>>>> 
>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>> consumption, while still providing fast access, I am not sure. Let me
>>> try a
>>>> matrix-backed DataModel approach)
>>>> 
>>>> Gokhan
>>>> 
>>>> 
>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>> 
>>>>> I completely agree, Netflix is less than one gigabye in a smart
>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>> FactorizablePreferences allow a much more memory efficient
>>>> representation,
>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>> into
>>>>> 3GB with that approach.
>>>>> 
>>>>> 
>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>> 
>>>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>> 
>>>>>> Note also that this is before you have done any work.
>>>>>> 
>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>> 
>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>> wrote:
>>>>>>> The second idea is indeed splendid, we should separate time-complexity
>>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>> sure,
>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>> laptop
>>>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>>>> map
>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>> would
>>>>>>> simply degrade time complexity for a linear search to O(n), not too
>>> bad
>>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>>> of
>>>>>>> subverting it.
> 
> 
> 




Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
Strange, its just a little bit larger than limibseti dataset (17m 
ratings), did you encountered an outOfMemory or GCTimeOut exception? 
Allocating more heap space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:
> It was about 2.5M users and 500K items with 25M actions over 6 months of data.
>
> On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>
> If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future.
>
> Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem.
>
> May I ask about the scale of your dataset? how many rating does it have?
>
> Yours Peng
>
> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>> Well, with itembased the only problem is new items. New users can
>> immediately be served by the model (although this is not well supported by
>> the API in Mahout). For the majority of usecases I saw, it is perfectly
>> fine to have a short delay until new items "enter" the recommender, usually
>> this happens after a retraining in batch. You have to care for cold-start
>> and collect some interactions anyway.
>>
>>
>> 2013/7/18 Pat Ferrel <pa...@gmail.com>
>>
>>> Yes, what Myrrix does is good.
>>>
>>> My last aside was a wish for an item-based online recommender not only
>>> factorized. Ted talks about using Solr for this, which we're experimenting
>>> with alongside Myrrix. I suspect Solr works but it does require a bit of
>>> tinkering and doesn't have quite the same set of options--no llr similarity
>>> for instance.
>>>
>>> On the same subject I recently attended a workshop in Seattle for UAI2013
>>> where Walmart reported similar results using a factorized recommender. They
>>> had to increase the factor number past where it would perform well. Along
>>> the way they saw increasing performance measuring precision offline. They
>>> eventually gave up on a factorized solution. This decision seems odd but
>>> anyway… In the case of Walmart and our data set they are quite diverse. The
>>> best idea is probably to create different recommenders for separate parts
>>> of the catalog but if you create one model on all items our intuition is
>>> that item-based works better than factorized. Again caveat--no A/B tests to
>>> support this yet.
>>>
>>> Doing an online item-based recommender would quickly run into scaling
>>> problems, no? We put together the simple Mahout in-memory version and it
>>> could not really handle more than a down-sampled few months of our data.
>>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
>>> version. Now we have use-cases for an online recommender that handles
>>> anonymous new users and that takes the story full circle.
>>>
>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>
>>> Hi Pat
>>>
>>> I think we should provide a simple support for recommending to anonymous
>>> users. We should have a method recommendToAnonymous() that takes a
>>> PreferenceArray as argument. For itembased recommenders, its
>>> straightforward to compute recommendations, for userbased you have to
>>> search through all users once, for latent factor models, you have to fold
>>> the user vector into the low dimensional space.
>>>
>>> I think Sean already added this method in myrrix and I have some code for
>>> my kornakapi project (a simple weblayer for mahout).
>>>
>>> Would such a method fit your needs?
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>>
>>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>>>
>>>> May I ask how you plan to support model updates and 'anonymous' users?
>>>>
>>>> I assume the latent factors model is calculated offline still in batch
>>>> mode, then there are periodic updates? How are the updates handled? Do
>>> you
>>>> plan to require batch model refactorization for any update? Or perform
>>> some
>>>> partial update by maybe just transforming new data into the LF space
>>>> already in place then doing full refactorization every so often in batch
>>>> mode?
>>>>
>>>> By 'anonymous users' I mean users with some history that is not yet
>>>> incorporated in the LF model. This could be history from a new user asked
>>>> to pick a few items to start the rec process, or an old user with some
>>> new
>>>> action history not yet in the model. Are you going to allow for passing
>>> the
>>>> entire history vector or userID+incremental new history to the
>>> recommender?
>>>> I hope so.
>>>>
>>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout
>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
>>> of
>>>> data. The data was purchase data from a diverse ecom source with a large
>>>> variety of products from electronics to clothes. We found Item based CF
>>> did
>>>> far better than ALS. As we increased the number of latent factors the
>>>> results got better but were never within 10% of item based (we used MAP
>>> as
>>>> the offline metric). Not sure why but maybe it has to do with the
>>> diversity
>>>> of the item types.
>>>>
>>>> I understand that a full item based online recommender has very different
>>>> tradeoffs and anyway others may not have seen this disparity of results.
>>>> Furthermore we don't have A/B test results yet to validate the offline
>>>> metric.
>>>>
>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>
>>>> Peng,
>>>>
>>>> This is the reason I separated out the DataModel, and only put the
>>> learner
>>>> stuff there. The learner I mentioned yesterday just stores the
>>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>>>> where preferences are stored.
>>>>
>>>> I, kind of, agree with the multi-level DataModel approach:
>>>> One for iterating over "all" preferences, one for if one wants to deploy
>>> a
>>>> recommender and perform a lot of top-N recommendation tasks.
>>>>
>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>> consumption, while still providing fast access, I am not sure. Let me
>>> try a
>>>> matrix-backed DataModel approach)
>>>>
>>>> Gokhan
>>>>
>>>>
>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>>
>>>>> I completely agree, Netflix is less than one gigabye in a smart
>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>> FactorizablePreferences allow a much more memory efficient
>>>> representation,
>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>> into
>>>>> 3GB with that approach.
>>>>>
>>>>>
>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>>
>>>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>
>>>>>> Note also that this is before you have done any work.
>>>>>>
>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>
>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>> wrote:
>>>>>>> The second idea is indeed splendid, we should separate time-complexity
>>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>> sure,
>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>> laptop
>>>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>>>> map
>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>> would
>>>>>>> simply degrade time complexity for a linear search to O(n), not too
>>> bad
>>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>>> of
>>>>>>> subverting it.
>
>
>



Re: Regarding Online Recommenders

Posted by Pat Ferrel <pa...@gmail.com>.
It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:

If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:
> Well, with itembased the only problem is new items. New users can
> immediately be served by the model (although this is not well supported by
> the API in Mahout). For the majority of usecases I saw, it is perfectly
> fine to have a short delay until new items "enter" the recommender, usually
> this happens after a retraining in batch. You have to care for cold-start
> and collect some interactions anyway.
> 
> 
> 2013/7/18 Pat Ferrel <pa...@gmail.com>
> 
>> Yes, what Myrrix does is good.
>> 
>> My last aside was a wish for an item-based online recommender not only
>> factorized. Ted talks about using Solr for this, which we're experimenting
>> with alongside Myrrix. I suspect Solr works but it does require a bit of
>> tinkering and doesn't have quite the same set of options--no llr similarity
>> for instance.
>> 
>> On the same subject I recently attended a workshop in Seattle for UAI2013
>> where Walmart reported similar results using a factorized recommender. They
>> had to increase the factor number past where it would perform well. Along
>> the way they saw increasing performance measuring precision offline. They
>> eventually gave up on a factorized solution. This decision seems odd but
>> anyway… In the case of Walmart and our data set they are quite diverse. The
>> best idea is probably to create different recommenders for separate parts
>> of the catalog but if you create one model on all items our intuition is
>> that item-based works better than factorized. Again caveat--no A/B tests to
>> support this yet.
>> 
>> Doing an online item-based recommender would quickly run into scaling
>> problems, no? We put together the simple Mahout in-memory version and it
>> could not really handle more than a down-sampled few months of our data.
>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
>> version. Now we have use-cases for an online recommender that handles
>> anonymous new users and that takes the story full circle.
>> 
>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> 
>> Hi Pat
>> 
>> I think we should provide a simple support for recommending to anonymous
>> users. We should have a method recommendToAnonymous() that takes a
>> PreferenceArray as argument. For itembased recommenders, its
>> straightforward to compute recommendations, for userbased you have to
>> search through all users once, for latent factor models, you have to fold
>> the user vector into the low dimensional space.
>> 
>> I think Sean already added this method in myrrix and I have some code for
>> my kornakapi project (a simple weblayer for mahout).
>> 
>> Would such a method fit your needs?
>> 
>> Best,
>> Sebastian
>> 
>> 
>> 
>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>> 
>>> May I ask how you plan to support model updates and 'anonymous' users?
>>> 
>>> I assume the latent factors model is calculated offline still in batch
>>> mode, then there are periodic updates? How are the updates handled? Do
>> you
>>> plan to require batch model refactorization for any update? Or perform
>> some
>>> partial update by maybe just transforming new data into the LF space
>>> already in place then doing full refactorization every so often in batch
>>> mode?
>>> 
>>> By 'anonymous users' I mean users with some history that is not yet
>>> incorporated in the LF model. This could be history from a new user asked
>>> to pick a few items to start the rec process, or an old user with some
>> new
>>> action history not yet in the model. Are you going to allow for passing
>> the
>>> entire history vector or userID+incremental new history to the
>> recommender?
>>> I hope so.
>>> 
>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout
>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
>> of
>>> data. The data was purchase data from a diverse ecom source with a large
>>> variety of products from electronics to clothes. We found Item based CF
>> did
>>> far better than ALS. As we increased the number of latent factors the
>>> results got better but were never within 10% of item based (we used MAP
>> as
>>> the offline metric). Not sure why but maybe it has to do with the
>> diversity
>>> of the item types.
>>> 
>>> I understand that a full item based online recommender has very different
>>> tradeoffs and anyway others may not have seen this disparity of results.
>>> Furthermore we don't have A/B test results yet to validate the offline
>>> metric.
>>> 
>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>> 
>>> Peng,
>>> 
>>> This is the reason I separated out the DataModel, and only put the
>> learner
>>> stuff there. The learner I mentioned yesterday just stores the
>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>>> where preferences are stored.
>>> 
>>> I, kind of, agree with the multi-level DataModel approach:
>>> One for iterating over "all" preferences, one for if one wants to deploy
>> a
>>> recommender and perform a lot of top-N recommendation tasks.
>>> 
>>> (Or one DataModel with a strategy that might reduce existing memory
>>> consumption, while still providing fast access, I am not sure. Let me
>> try a
>>> matrix-backed DataModel approach)
>>> 
>>> Gokhan
>>> 
>>> 
>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>> wrote:
>>> 
>>>> I completely agree, Netflix is less than one gigabye in a smart
>>>> representation, 12x more memory is a nogo. The techniques used in
>>>> FactorizablePreferences allow a much more memory efficient
>>> representation,
>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>> into
>>>> 3GB with that approach.
>>>> 
>>>> 
>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>> 
>>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>> 
>>>>> Note also that this is before you have done any work.
>>>>> 
>>>>> Ideally, 100million observations should take << 1GB.
>>>>> 
>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>> wrote:
>>>>>> The second idea is indeed splendid, we should separate time-complexity
>>>>>> first and space-complexity first implementation. What I'm not quite
>>>> sure,
>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>> laptop
>>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>>> map
>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>> would
>>>>>> simply degrade time complexity for a linear search to O(n), not too
>> bad
>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>> of
>>>>>> subverting it.
>>> 
>> 




Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but 
this is definitely I would like to pursue. It's probably the only 
advantage a non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for 
sandboxes, IMHO they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:
> Well, with itembased the only problem is new items. New users can
> immediately be served by the model (although this is not well supported by
> the API in Mahout). For the majority of usecases I saw, it is perfectly
> fine to have a short delay until new items "enter" the recommender, usually
> this happens after a retraining in batch. You have to care for cold-start
> and collect some interactions anyway.
>
>
> 2013/7/18 Pat Ferrel <pa...@gmail.com>
>
>> Yes, what Myrrix does is good.
>>
>> My last aside was a wish for an item-based online recommender not only
>> factorized. Ted talks about using Solr for this, which we're experimenting
>> with alongside Myrrix. I suspect Solr works but it does require a bit of
>> tinkering and doesn't have quite the same set of options--no llr similarity
>> for instance.
>>
>> On the same subject I recently attended a workshop in Seattle for UAI2013
>> where Walmart reported similar results using a factorized recommender. They
>> had to increase the factor number past where it would perform well. Along
>> the way they saw increasing performance measuring precision offline. They
>> eventually gave up on a factorized solution. This decision seems odd but
>> anyway… In the case of Walmart and our data set they are quite diverse. The
>> best idea is probably to create different recommenders for separate parts
>> of the catalog but if you create one model on all items our intuition is
>> that item-based works better than factorized. Again caveat--no A/B tests to
>> support this yet.
>>
>> Doing an online item-based recommender would quickly run into scaling
>> problems, no? We put together the simple Mahout in-memory version and it
>> could not really handle more than a down-sampled few months of our data.
>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
>> version. Now we have use-cases for an online recommender that handles
>> anonymous new users and that takes the story full circle.
>>
>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>
>> Hi Pat
>>
>> I think we should provide a simple support for recommending to anonymous
>> users. We should have a method recommendToAnonymous() that takes a
>> PreferenceArray as argument. For itembased recommenders, its
>> straightforward to compute recommendations, for userbased you have to
>> search through all users once, for latent factor models, you have to fold
>> the user vector into the low dimensional space.
>>
>> I think Sean already added this method in myrrix and I have some code for
>> my kornakapi project (a simple weblayer for mahout).
>>
>> Would such a method fit your needs?
>>
>> Best,
>> Sebastian
>>
>>
>>
>> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>>
>>> May I ask how you plan to support model updates and 'anonymous' users?
>>>
>>> I assume the latent factors model is calculated offline still in batch
>>> mode, then there are periodic updates? How are the updates handled? Do
>> you
>>> plan to require batch model refactorization for any update? Or perform
>> some
>>> partial update by maybe just transforming new data into the LF space
>>> already in place then doing full refactorization every so often in batch
>>> mode?
>>>
>>> By 'anonymous users' I mean users with some history that is not yet
>>> incorporated in the LF model. This could be history from a new user asked
>>> to pick a few items to start the rec process, or an old user with some
>> new
>>> action history not yet in the model. Are you going to allow for passing
>> the
>>> entire history vector or userID+incremental new history to the
>> recommender?
>>> I hope so.
>>>
>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout
>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
>> of
>>> data. The data was purchase data from a diverse ecom source with a large
>>> variety of products from electronics to clothes. We found Item based CF
>> did
>>> far better than ALS. As we increased the number of latent factors the
>>> results got better but were never within 10% of item based (we used MAP
>> as
>>> the offline metric). Not sure why but maybe it has to do with the
>> diversity
>>> of the item types.
>>>
>>> I understand that a full item based online recommender has very different
>>> tradeoffs and anyway others may not have seen this disparity of results.
>>> Furthermore we don't have A/B test results yet to validate the offline
>>> metric.
>>>
>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>
>>> Peng,
>>>
>>> This is the reason I separated out the DataModel, and only put the
>> learner
>>> stuff there. The learner I mentioned yesterday just stores the
>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>>> where preferences are stored.
>>>
>>> I, kind of, agree with the multi-level DataModel approach:
>>> One for iterating over "all" preferences, one for if one wants to deploy
>> a
>>> recommender and perform a lot of top-N recommendation tasks.
>>>
>>> (Or one DataModel with a strategy that might reduce existing memory
>>> consumption, while still providing fast access, I am not sure. Let me
>> try a
>>> matrix-backed DataModel approach)
>>>
>>> Gokhan
>>>
>>>
>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>> wrote:
>>>
>>>> I completely agree, Netflix is less than one gigabye in a smart
>>>> representation, 12x more memory is a nogo. The techniques used in
>>>> FactorizablePreferences allow a much more memory efficient
>>> representation,
>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>> into
>>>> 3GB with that approach.
>>>>
>>>>
>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>
>>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>
>>>>> Note also that this is before you have done any work.
>>>>>
>>>>> Ideally, 100million observations should take << 1GB.
>>>>>
>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>> wrote:
>>>>>> The second idea is indeed splendid, we should separate time-complexity
>>>>>> first and space-complexity first implementation. What I'm not quite
>>>> sure,
>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>> laptop
>>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>>> map
>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>> would
>>>>>> simply degrade time complexity for a linear search to O(n), not too
>> bad
>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>> of
>>>>>> subverting it.
>>>
>>



Re: Regarding Online Recommenders

Posted by Sebastian Schelter <ss...@apache.org>.
Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items "enter" the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel <pa...@gmail.com>

> Yes, what Myrrix does is good.
>
> My last aside was a wish for an item-based online recommender not only
> factorized. Ted talks about using Solr for this, which we're experimenting
> with alongside Myrrix. I suspect Solr works but it does require a bit of
> tinkering and doesn't have quite the same set of options--no llr similarity
> for instance.
>
> On the same subject I recently attended a workshop in Seattle for UAI2013
> where Walmart reported similar results using a factorized recommender. They
> had to increase the factor number past where it would perform well. Along
> the way they saw increasing performance measuring precision offline. They
> eventually gave up on a factorized solution. This decision seems odd but
> anyway… In the case of Walmart and our data set they are quite diverse. The
> best idea is probably to create different recommenders for separate parts
> of the catalog but if you create one model on all items our intuition is
> that item-based works better than factorized. Again caveat--no A/B tests to
> support this yet.
>
> Doing an online item-based recommender would quickly run into scaling
> problems, no? We put together the simple Mahout in-memory version and it
> could not really handle more than a down-sampled few months of our data.
> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
> version. Now we have use-cases for an online recommender that handles
> anonymous new users and that takes the story full circle.
>
> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
> Hi Pat
>
> I think we should provide a simple support for recommending to anonymous
> users. We should have a method recommendToAnonymous() that takes a
> PreferenceArray as argument. For itembased recommenders, its
> straightforward to compute recommendations, for userbased you have to
> search through all users once, for latent factor models, you have to fold
> the user vector into the low dimensional space.
>
> I think Sean already added this method in myrrix and I have some code for
> my kornakapi project (a simple weblayer for mahout).
>
> Would such a method fit your needs?
>
> Best,
> Sebastian
>
>
>
> 2013/7/17 Pat Ferrel <pa...@gmail.com>
>
> > May I ask how you plan to support model updates and 'anonymous' users?
> >
> > I assume the latent factors model is calculated offline still in batch
> > mode, then there are periodic updates? How are the updates handled? Do
> you
> > plan to require batch model refactorization for any update? Or perform
> some
> > partial update by maybe just transforming new data into the LF space
> > already in place then doing full refactorization every so often in batch
> > mode?
> >
> > By 'anonymous users' I mean users with some history that is not yet
> > incorporated in the LF model. This could be history from a new user asked
> > to pick a few items to start the rec process, or an old user with some
> new
> > action history not yet in the model. Are you going to allow for passing
> the
> > entire history vector or userID+incremental new history to the
> recommender?
> > I hope so.
> >
> > For what it's worth we did a comparison of Mahout Item based CF to Mahout
> > ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
> of
> > data. The data was purchase data from a diverse ecom source with a large
> > variety of products from electronics to clothes. We found Item based CF
> did
> > far better than ALS. As we increased the number of latent factors the
> > results got better but were never within 10% of item based (we used MAP
> as
> > the offline metric). Not sure why but maybe it has to do with the
> diversity
> > of the item types.
> >
> > I understand that a full item based online recommender has very different
> > tradeoffs and anyway others may not have seen this disparity of results.
> > Furthermore we don't have A/B test results yet to validate the offline
> > metric.
> >
> > On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
> >
> > Peng,
> >
> > This is the reason I separated out the DataModel, and only put the
> learner
> > stuff there. The learner I mentioned yesterday just stores the
> > parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> > where preferences are stored.
> >
> > I, kind of, agree with the multi-level DataModel approach:
> > One for iterating over "all" preferences, one for if one wants to deploy
> a
> > recommender and perform a lot of top-N recommendation tasks.
> >
> > (Or one DataModel with a strategy that might reduce existing memory
> > consumption, while still providing fast access, I am not sure. Let me
> try a
> > matrix-backed DataModel approach)
> >
> > Gokhan
> >
> >
> > On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> >
> >> I completely agree, Netflix is less than one gigabye in a smart
> >> representation, 12x more memory is a nogo. The techniques used in
> >> FactorizablePreferences allow a much more memory efficient
> > representation,
> >> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> > into
> >> 3GB with that approach.
> >>
> >>
> >> 2013/7/16 Ted Dunning <te...@gmail.com>
> >>
> >>> Netflix is a small dataset.  12G for that seems quite excessive.
> >>>
> >>> Note also that this is before you have done any work.
> >>>
> >>> Ideally, 100million observations should take << 1GB.
> >>>
> >>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
> >> wrote:
> >>>
> >>>> The second idea is indeed splendid, we should separate time-complexity
> >>>> first and space-complexity first implementation. What I'm not quite
> >> sure,
> >>>> is that if we really need to create two interfaces instead of one.
> >>>> Personally, I think 12G heap space is not that high right? Most new
> >>> laptop
> >>>> can already handle that (emphasis on laptop). And if we replace hash
> >> map
> >>>> (the culprit of high memory consumption) with list/linkedList, it
> would
> >>>> simply degrade time complexity for a linear search to O(n), not too
> bad
> >>>> either. The current DataModel is a result of careful thoughts and has
> >>>> underwent extensive test, it is easier to expand on top of it instead
> >> of
> >>>> subverting it.
> >>>
> >>
> >
> >
>
>

Re: Regarding Online Recommenders

Posted by Pat Ferrel <pa...@gmail.com>.
Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. 

On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet.

Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <ss...@apache.org> wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel <pa...@gmail.com>

> May I ask how you plan to support model updates and 'anonymous' users?
> 
> I assume the latent factors model is calculated offline still in batch
> mode, then there are periodic updates? How are the updates handled? Do you
> plan to require batch model refactorization for any update? Or perform some
> partial update by maybe just transforming new data into the LF space
> already in place then doing full refactorization every so often in batch
> mode?
> 
> By 'anonymous users' I mean users with some history that is not yet
> incorporated in the LF model. This could be history from a new user asked
> to pick a few items to start the rec process, or an old user with some new
> action history not yet in the model. Are you going to allow for passing the
> entire history vector or userID+incremental new history to the recommender?
> I hope so.
> 
> For what it's worth we did a comparison of Mahout Item based CF to Mahout
> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
> data. The data was purchase data from a diverse ecom source with a large
> variety of products from electronics to clothes. We found Item based CF did
> far better than ALS. As we increased the number of latent factors the
> results got better but were never within 10% of item based (we used MAP as
> the offline metric). Not sure why but maybe it has to do with the diversity
> of the item types.
> 
> I understand that a full item based online recommender has very different
> tradeoffs and anyway others may not have seen this disparity of results.
> Furthermore we don't have A/B test results yet to validate the offline
> metric.
> 
> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
> 
> Peng,
> 
> This is the reason I separated out the DataModel, and only put the learner
> stuff there. The learner I mentioned yesterday just stores the
> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> where preferences are stored.
> 
> I, kind of, agree with the multi-level DataModel approach:
> One for iterating over "all" preferences, one for if one wants to deploy a
> recommender and perform a lot of top-N recommendation tasks.
> 
> (Or one DataModel with a strategy that might reduce existing memory
> consumption, while still providing fast access, I am not sure. Let me try a
> matrix-backed DataModel approach)
> 
> Gokhan
> 
> 
> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
> 
>> I completely agree, Netflix is less than one gigabye in a smart
>> representation, 12x more memory is a nogo. The techniques used in
>> FactorizablePreferences allow a much more memory efficient
> representation,
>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> into
>> 3GB with that approach.
>> 
>> 
>> 2013/7/16 Ted Dunning <te...@gmail.com>
>> 
>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>> 
>>> Note also that this is before you have done any work.
>>> 
>>> Ideally, 100million observations should take << 1GB.
>>> 
>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>> wrote:
>>> 
>>>> The second idea is indeed splendid, we should separate time-complexity
>>>> first and space-complexity first implementation. What I'm not quite
>> sure,
>>>> is that if we really need to create two interfaces instead of one.
>>>> Personally, I think 12G heap space is not that high right? Most new
>>> laptop
>>>> can already handle that (emphasis on laptop). And if we replace hash
>> map
>>>> (the culprit of high memory consumption) with list/linkedList, it would
>>>> simply degrade time complexity for a linear search to O(n), not too bad
>>>> either. The current DataModel is a result of careful thoughts and has
>>>> underwent extensive test, it is easier to expand on top of it instead
>> of
>>>> subverting it.
>>> 
>> 
> 
> 


Re: Regarding Online Recommenders

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel <pa...@gmail.com>

> May I ask how you plan to support model updates and 'anonymous' users?
>
> I assume the latent factors model is calculated offline still in batch
> mode, then there are periodic updates? How are the updates handled? Do you
> plan to require batch model refactorization for any update? Or perform some
> partial update by maybe just transforming new data into the LF space
> already in place then doing full refactorization every so often in batch
> mode?
>
> By 'anonymous users' I mean users with some history that is not yet
> incorporated in the LF model. This could be history from a new user asked
> to pick a few items to start the rec process, or an old user with some new
> action history not yet in the model. Are you going to allow for passing the
> entire history vector or userID+incremental new history to the recommender?
> I hope so.
>
> For what it's worth we did a comparison of Mahout Item based CF to Mahout
> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
> data. The data was purchase data from a diverse ecom source with a large
> variety of products from electronics to clothes. We found Item based CF did
> far better than ALS. As we increased the number of latent factors the
> results got better but were never within 10% of item based (we used MAP as
> the offline metric). Not sure why but maybe it has to do with the diversity
> of the item types.
>
> I understand that a full item based online recommender has very different
> tradeoffs and anyway others may not have seen this disparity of results.
> Furthermore we don't have A/B test results yet to validate the offline
> metric.
>
> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>
> Peng,
>
> This is the reason I separated out the DataModel, and only put the learner
> stuff there. The learner I mentioned yesterday just stores the
> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> where preferences are stored.
>
> I, kind of, agree with the multi-level DataModel approach:
> One for iterating over "all" preferences, one for if one wants to deploy a
> recommender and perform a lot of top-N recommendation tasks.
>
> (Or one DataModel with a strategy that might reduce existing memory
> consumption, while still providing fast access, I am not sure. Let me try a
> matrix-backed DataModel approach)
>
> Gokhan
>
>
> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > I completely agree, Netflix is less than one gigabye in a smart
> > representation, 12x more memory is a nogo. The techniques used in
> > FactorizablePreferences allow a much more memory efficient
> representation,
> > tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> into
> > 3GB with that approach.
> >
> >
> > 2013/7/16 Ted Dunning <te...@gmail.com>
> >
> >> Netflix is a small dataset.  12G for that seems quite excessive.
> >>
> >> Note also that this is before you have done any work.
> >>
> >> Ideally, 100million observations should take << 1GB.
> >>
> >> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
> > wrote:
> >>
> >>> The second idea is indeed splendid, we should separate time-complexity
> >>> first and space-complexity first implementation. What I'm not quite
> > sure,
> >>> is that if we really need to create two interfaces instead of one.
> >>> Personally, I think 12G heap space is not that high right? Most new
> >> laptop
> >>> can already handle that (emphasis on laptop). And if we replace hash
> > map
> >>> (the culprit of high memory consumption) with list/linkedList, it would
> >>> simply degrade time complexity for a linear search to O(n), not too bad
> >>> either. The current DataModel is a result of careful thoughts and has
> >>> underwent extensive test, it is easier to expand on top of it instead
> > of
> >>> subverting it.
> >>
> >
>
>

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
For a low-rank matrix factorization based recommender, a new preference 
is not itself, but a dot product of two vectors in the low dimensional 
space, so it needs no projection. The user and item vectors however may 
need to be projected into a lower dimensional space, if and only if you 
want to reduce the rank of the preference matrix. The refactorization 
step in SGD is super fast--that's the charm of SGD. So, yes, we will 
refactorize in every update.

Yours Peng

On 13-07-18 11:34 AM, Pat Ferrel wrote:
>> On Jul 17, 2013, at 1:19 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> Hi Pat, please see my response inline.
>>
>> Best,
>> Gokhan
>>
>>
>> On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>
>>> May I ask how you plan to support model updates and 'anonymous' users?
>>>
>>> I assume the latent factors model is calculated offline still in batch
>>> mode, then there are periodic updates? How are the updates handled?
>>
>> If you are referring to the recommender of discussion here, no, updating
>> the model can be done with a single preference, using stochastic gradient
>> descent, by updating the particular user and item factors simultaneously.
>>
> Aren't there two different things needed to truly update the model: 1) add the new preference to the lower dimensional space 2) refactorize the all preferences. #2 only needs to be done periodically--afaik. #1 would be super fast and could be done at runtime.  Am I wrong or are you planning to incrementally refactorize the entire preference array with every new preference?
>



Re: Regarding Online Recommenders

Posted by Pat Ferrel <pa...@gmail.com>.
> 
> On Jul 17, 2013, at 1:19 PM, Gokhan Capan <gk...@gmail.com> wrote:
> 
> Hi Pat, please see my response inline.
> 
> Best,
> Gokhan
> 
> 
> On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel <pa...@gmail.com> wrote:
> 
>> May I ask how you plan to support model updates and 'anonymous' users?
>> 
>> I assume the latent factors model is calculated offline still in batch
>> mode, then there are periodic updates? How are the updates handled?
> 
> 
> If you are referring to the recommender of discussion here, no, updating
> the model can be done with a single preference, using stochastic gradient
> descent, by updating the particular user and item factors simultaneously.
> 

Aren't there two different things needed to truly update the model: 1) add the new preference to the lower dimensional space 2) refactorize the all preferences. #2 only needs to be done periodically--afaik. #1 would be super fast and could be done at runtime.  Am I wrong or are you planning to incrementally refactorize the entire preference array with every new preference?

Re: Regarding Online Recommenders

Posted by Gokhan Capan <gk...@gmail.com>.
Hi Pat, please see my response inline.

Best,
Gokhan


On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel <pa...@gmail.com> wrote:

> May I ask how you plan to support model updates and 'anonymous' users?
>
> I assume the latent factors model is calculated offline still in batch
> mode, then there are periodic updates? How are the updates handled?


If you are referring to the recommender of discussion here, no, updating
the model can be done with a single preference, using stochastic gradient
descent, by updating the particular user and item factors simultaneously.

Do you plan to require batch model refactorization for any update? Or
> perform some partial update by maybe just transforming new data into the LF
> space already in place then doing full refactorization every so often in
> batch mode?
>
> By 'anonymous users' I mean users with some history that is not yet
> incorporated in the LF model. This could be history from a new user asked
> to pick a few items to start the rec process, or an old user with some new
> action history not yet in the model. Are you going to allow for passing the
> entire history vector or userID+incremental new history to the recommender?
> I hope so.


> For what it's worth we did a comparison of Mahout Item based CF to Mahout
> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
> data. The data was purchase data from a diverse ecom source with a large
> variety of products from electronics to clothes. We found Item based CF did
> far better than ALS. As we increased the number of latent factors the
> results got better but were never within 10% of item based (we used MAP as
> the offline metric). Not sure why but maybe it has to do with the diversity
> of the item types.
>

My first question, are those actions are only positive, like "purchase" as
you mentioned?


> I understand that a full item based online recommender has very different
> tradeoffs and anyway others may not have seen this disparity of results.
> Furthermore we don't have A/B test results yet to validate the offline
> metric.


I personally think an A/B test is the best way to evaluate a recommender,
and if you will be able to share it, I personally look forward to see the
results. I believe that would be a great contribution for some future
decisions.


> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:
>
> Peng,
>
> This is the reason I separated out the DataModel, and only put the learner
> stuff there. The learner I mentioned yesterday just stores the
> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> where preferences are stored.
>
> I, kind of, agree with the multi-level DataModel approach:
> One for iterating over "all" preferences, one for if one wants to deploy a
> recommender and perform a lot of top-N recommendation tasks.
>
> (Or one DataModel with a strategy that might reduce existing memory
> consumption, while still providing fast access, I am not sure. Let me try a
> matrix-backed DataModel approach)
>
> Gokhan
>
>
> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > I completely agree, Netflix is less than one gigabye in a smart
> > representation, 12x more memory is a nogo. The techniques used in
> > FactorizablePreferences allow a much more memory efficient
> representation,
> > tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> into
> > 3GB with that approach.
> >
> >
> > 2013/7/16 Ted Dunning <te...@gmail.com>
> >
> >> Netflix is a small dataset.  12G for that seems quite excessive.
> >>
> >> Note also that this is before you have done any work.
> >>
> >> Ideally, 100million observations should take << 1GB.
> >>
> >> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
> > wrote:
> >>
> >>> The second idea is indeed splendid, we should separate time-complexity
> >>> first and space-complexity first implementation. What I'm not quite
> > sure,
> >>> is that if we really need to create two interfaces instead of one.
> >>> Personally, I think 12G heap space is not that high right? Most new
> >> laptop
> >>> can already handle that (emphasis on laptop). And if we replace hash
> > map
> >>> (the culprit of high memory consumption) with list/linkedList, it would
> >>> simply degrade time complexity for a linear search to O(n), not too bad
> >>> either. The current DataModel is a result of careful thoughts and has
> >>> underwent extensive test, it is easier to expand on top of it instead
> > of
> >>> subverting it.
> >>
> >
>
>

Re: Regarding Online Recommenders

Posted by Pat Ferrel <pa...@gmail.com>.
May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode?

By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types.

I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric.

On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gk...@gmail.com> wrote:

Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over "all" preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org> wrote:

> I completely agree, Netflix is less than one gigabye in a smart
> representation, 12x more memory is a nogo. The techniques used in
> FactorizablePreferences allow a much more memory efficient representation,
> tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
> 3GB with that approach.
> 
> 
> 2013/7/16 Ted Dunning <te...@gmail.com>
> 
>> Netflix is a small dataset.  12G for that seems quite excessive.
>> 
>> Note also that this is before you have done any work.
>> 
>> Ideally, 100million observations should take << 1GB.
>> 
>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
> wrote:
>> 
>>> The second idea is indeed splendid, we should separate time-complexity
>>> first and space-complexity first implementation. What I'm not quite
> sure,
>>> is that if we really need to create two interfaces instead of one.
>>> Personally, I think 12G heap space is not that high right? Most new
>> laptop
>>> can already handle that (emphasis on laptop). And if we replace hash
> map
>>> (the culprit of high memory consumption) with list/linkedList, it would
>>> simply degrade time complexity for a linear search to O(n), not too bad
>>> either. The current DataModel is a result of careful thoughts and has
>>> underwent extensive test, it is easier to expand on top of it instead
> of
>>> subverting it.
>> 
> 


Re: Regarding Online Recommenders

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Abhishek,

Great to hear that you're willing to put some work into this! Have you
ever worked with Mahout's recommenders before? If not, then a good first
step would be to get familiar with them and code up a few examples.

Best,
Sebastian

On 17.07.2013 07:29, Abhishek Sharma wrote:
> Sorry to interrupt guys, but I just wanted to bring it to your notice that
> I am also interested in contributing to this idea. I am planning to
> participate in ASF-ICFOSS mentor-ship
> programme<https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme>.
> (this is very similar to GSOC)
> 
> I do have strong concepts in machine learning (have done the ML course by
> Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
> experience). I am not really sure of how can I approach this problem (but I
> do have a strong interest to work on this problem) hence would like to pair
> up on this. I am currently working as a research intern at Indian Institute
> of Science (IISc), Bangalore India and can put up 15-20 hrs per week.
> 
> Please let me know your thoughts if I can be a part of this.
> 
> Thanks & Regards,
> Abhishek Sharma
> http://www.linkedin.com/in/abhi21
> https://github.com/abhi21
> 
> 
> On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan <gk...@gmail.com> wrote:
> 
>> Peng,
>>
>> This is the reason I separated out the DataModel, and only put the learner
>> stuff there. The learner I mentioned yesterday just stores the
>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>> where preferences are stored.
>>
>> I, kind of, agree with the multi-level DataModel approach:
>> One for iterating over "all" preferences, one for if one wants to deploy a
>> recommender and perform a lot of top-N recommendation tasks.
>>
>> (Or one DataModel with a strategy that might reduce existing memory
>> consumption, while still providing fast access, I am not sure. Let me try a
>> matrix-backed DataModel approach)
>>
>> Gokhan
>>
>>
>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>> I completely agree, Netflix is less than one gigabye in a smart
>>> representation, 12x more memory is a nogo. The techniques used in
>>> FactorizablePreferences allow a much more memory efficient
>> representation,
>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>> into
>>> 3GB with that approach.
>>>
>>>
>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>
>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>
>>>> Note also that this is before you have done any work.
>>>>
>>>> Ideally, 100million observations should take << 1GB.
>>>>
>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>> wrote:
>>>>
>>>>> The second idea is indeed splendid, we should separate
>> time-complexity
>>>>> first and space-complexity first implementation. What I'm not quite
>>> sure,
>>>>> is that if we really need to create two interfaces instead of one.
>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>> laptop
>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>> map
>>>>> (the culprit of high memory consumption) with list/linkedList, it
>> would
>>>>> simply degrade time complexity for a linear search to O(n), not too
>> bad
>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>> underwent extensive test, it is easier to expand on top of it instead
>>> of
>>>>> subverting it.
>>>>
>>>
>>
> 
> 
> 


Re: Regarding Online Recommenders

Posted by Abhishek Sharma <jk...@gmail.com>.
Thanks for the encouragement guys and my apologies for deviating from the
discussion again.

As suggested by Sebastian, I looked into the Mahout's recommenders and
implemented a User-based recommender as well. IMHO, I now have a fair
understanding about the recommenders.

In order to be a part of ASF-ICFOSS mentor-ship
programme<https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme>I
am required to submit a proposal discussing about my plan in order to
work on this idea. My understanding, based on the discussion on this
thread, so far is that we are still in the process of finalizing an
approach and yet to start working on the idea.

I need your help on the following:
1) Could you please suggest the sub-features that I would be able to
contribute to and thus can mention them under expected deliverables ?
2) Also, I would be really grateful if one of you could be my
mentor<http://community.apache.org/mentoringprogramme.html>as a part
of this programme (As a mentee I would need only some
pointers/directions to proceed further). Please let me know.

I am really excited and looking forward to a positive response to work on
this feature.

Thanks & Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21

On Thu, Jul 18, 2013 at 2:39 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:

> Awesome! your reinforcements are highly appreciated.
>
>
> On 13-07-17 01:29 AM, Abhishek Sharma wrote:
>
>> Sorry to interrupt guys, but I just wanted to bring it to your notice that
>> I am also interested in contributing to this idea. I am planning to
>> participate in ASF-ICFOSS mentor-ship
>> programme<https://cwiki.**apache.org/confluence/display/**
>> COMDEV/ASF-ICFOSS+Pilot+**Mentoring+Programme<https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme>
>> >.
>>
>> (this is very similar to GSOC)
>>
>> I do have strong concepts in machine learning (have done the ML course by
>> Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of
>> work
>> experience). I am not really sure of how can I approach this problem (but
>> I
>> do have a strong interest to work on this problem) hence would like to
>> pair
>> up on this. I am currently working as a research intern at Indian
>> Institute
>> of Science (IISc), Bangalore India and can put up 15-20 hrs per week.
>>
>> Please let me know your thoughts if I can be a part of this.
>>
>> Thanks & Regards,
>> Abhishek Sharma
>> http://www.linkedin.com/in/**abhi21 <http://www.linkedin.com/in/abhi21>
>> https://github.com/abhi21
>>
>>
>> On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>>  Peng,
>>>
>>> This is the reason I separated out the DataModel, and only put the
>>> learner
>>> stuff there. The learner I mentioned yesterday just stores the
>>> parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not care
>>> where preferences are stored.
>>>
>>> I, kind of, agree with the multi-level DataModel approach:
>>> One for iterating over "all" preferences, one for if one wants to deploy
>>> a
>>> recommender and perform a lot of top-N recommendation tasks.
>>>
>>> (Or one DataModel with a strategy that might reduce existing memory
>>> consumption, while still providing fast access, I am not sure. Let me
>>> try a
>>> matrix-backed DataModel approach)
>>>
>>> Gokhan
>>>
>>>
>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>>> wrote:
>>>
>>>  I completely agree, Netflix is less than one gigabye in a smart
>>>> representation, 12x more memory is a nogo. The techniques used in
>>>> FactorizablePreferences allow a much more memory efficient
>>>>
>>> representation,
>>>
>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>>
>>> into
>>>
>>>> 3GB with that approach.
>>>>
>>>>
>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>
>>>>  Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>
>>>>> Note also that this is before you have done any work.
>>>>>
>>>>> Ideally, 100million observations should take << 1GB.
>>>>>
>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>>
>>>> wrote:
>>>>
>>>>> The second idea is indeed splendid, we should separate
>>>>>>
>>>>> time-complexity
>>>
>>>> first and space-complexity first implementation. What I'm not quite
>>>>>>
>>>>> sure,
>>>>
>>>>> is that if we really need to create two interfaces instead of one.
>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>
>>>>> laptop
>>>>>
>>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>>>>>
>>>>> map
>>>>
>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>
>>>>> would
>>>
>>>> simply degrade time complexity for a linear search to O(n), not too
>>>>>>
>>>>> bad
>>>
>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>>>>
>>>>> of
>>>>
>>>>> subverting it.
>>>>>>
>>>>>
>>
>>
>
>


-- 
--
Abhishek Sharma
ThoughtWorks

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
Awesome! your reinforcements are highly appreciated.

On 13-07-17 01:29 AM, Abhishek Sharma wrote:
> Sorry to interrupt guys, but I just wanted to bring it to your notice that
> I am also interested in contributing to this idea. I am planning to
> participate in ASF-ICFOSS mentor-ship
> programme<https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme>.
> (this is very similar to GSOC)
>
> I do have strong concepts in machine learning (have done the ML course by
> Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
> experience). I am not really sure of how can I approach this problem (but I
> do have a strong interest to work on this problem) hence would like to pair
> up on this. I am currently working as a research intern at Indian Institute
> of Science (IISc), Bangalore India and can put up 15-20 hrs per week.
>
> Please let me know your thoughts if I can be a part of this.
>
> Thanks & Regards,
> Abhishek Sharma
> http://www.linkedin.com/in/abhi21
> https://github.com/abhi21
>
>
> On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan <gk...@gmail.com> wrote:
>
>> Peng,
>>
>> This is the reason I separated out the DataModel, and only put the learner
>> stuff there. The learner I mentioned yesterday just stores the
>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>> where preferences are stored.
>>
>> I, kind of, agree with the multi-level DataModel approach:
>> One for iterating over "all" preferences, one for if one wants to deploy a
>> recommender and perform a lot of top-N recommendation tasks.
>>
>> (Or one DataModel with a strategy that might reduce existing memory
>> consumption, while still providing fast access, I am not sure. Let me try a
>> matrix-backed DataModel approach)
>>
>> Gokhan
>>
>>
>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>> I completely agree, Netflix is less than one gigabye in a smart
>>> representation, 12x more memory is a nogo. The techniques used in
>>> FactorizablePreferences allow a much more memory efficient
>> representation,
>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>> into
>>> 3GB with that approach.
>>>
>>>
>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>
>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>
>>>> Note also that this is before you have done any work.
>>>>
>>>> Ideally, 100million observations should take << 1GB.
>>>>
>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>> wrote:
>>>>> The second idea is indeed splendid, we should separate
>> time-complexity
>>>>> first and space-complexity first implementation. What I'm not quite
>>> sure,
>>>>> is that if we really need to create two interfaces instead of one.
>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>> laptop
>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>> map
>>>>> (the culprit of high memory consumption) with list/linkedList, it
>> would
>>>>> simply degrade time complexity for a linear search to O(n), not too
>> bad
>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>> underwent extensive test, it is easier to expand on top of it instead
>>> of
>>>>> subverting it.
>
>



Re: Regarding Online Recommenders

Posted by Abhishek Sharma <jk...@gmail.com>.
Sorry to interrupt guys, but I just wanted to bring it to your notice that
I am also interested in contributing to this idea. I am planning to
participate in ASF-ICFOSS mentor-ship
programme<https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme>.
(this is very similar to GSOC)

I do have strong concepts in machine learning (have done the ML course by
Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
experience). I am not really sure of how can I approach this problem (but I
do have a strong interest to work on this problem) hence would like to pair
up on this. I am currently working as a research intern at Indian Institute
of Science (IISc), Bangalore India and can put up 15-20 hrs per week.

Please let me know your thoughts if I can be a part of this.

Thanks & Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21


On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan <gk...@gmail.com> wrote:

> Peng,
>
> This is the reason I separated out the DataModel, and only put the learner
> stuff there. The learner I mentioned yesterday just stores the
> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> where preferences are stored.
>
> I, kind of, agree with the multi-level DataModel approach:
> One for iterating over "all" preferences, one for if one wants to deploy a
> recommender and perform a lot of top-N recommendation tasks.
>
> (Or one DataModel with a strategy that might reduce existing memory
> consumption, while still providing fast access, I am not sure. Let me try a
> matrix-backed DataModel approach)
>
> Gokhan
>
>
> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > I completely agree, Netflix is less than one gigabye in a smart
> > representation, 12x more memory is a nogo. The techniques used in
> > FactorizablePreferences allow a much more memory efficient
> representation,
> > tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> into
> > 3GB with that approach.
> >
> >
> > 2013/7/16 Ted Dunning <te...@gmail.com>
> >
> > > Netflix is a small dataset.  12G for that seems quite excessive.
> > >
> > > Note also that this is before you have done any work.
> > >
> > > Ideally, 100million observations should take << 1GB.
> > >
> > > On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
> > wrote:
> > >
> > > > The second idea is indeed splendid, we should separate
> time-complexity
> > > > first and space-complexity first implementation. What I'm not quite
> > sure,
> > > > is that if we really need to create two interfaces instead of one.
> > > > Personally, I think 12G heap space is not that high right? Most new
> > > laptop
> > > > can already handle that (emphasis on laptop). And if we replace hash
> > map
> > > > (the culprit of high memory consumption) with list/linkedList, it
> would
> > > > simply degrade time complexity for a linear search to O(n), not too
> bad
> > > > either. The current DataModel is a result of careful thoughts and has
> > > > underwent extensive test, it is easier to expand on top of it instead
> > of
> > > > subverting it.
> > >
> >
>



-- 
--
Abhishek Sharma
ThoughtWorks

Re: Regarding Online Recommenders

Posted by Gokhan Capan <gk...@gmail.com>.
Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over "all" preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <ss...@apache.org> wrote:

> I completely agree, Netflix is less than one gigabye in a smart
> representation, 12x more memory is a nogo. The techniques used in
> FactorizablePreferences allow a much more memory efficient representation,
> tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
> 3GB with that approach.
>
>
> 2013/7/16 Ted Dunning <te...@gmail.com>
>
> > Netflix is a small dataset.  12G for that seems quite excessive.
> >
> > Note also that this is before you have done any work.
> >
> > Ideally, 100million observations should take << 1GB.
> >
> > On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
> wrote:
> >
> > > The second idea is indeed splendid, we should separate time-complexity
> > > first and space-complexity first implementation. What I'm not quite
> sure,
> > > is that if we really need to create two interfaces instead of one.
> > > Personally, I think 12G heap space is not that high right? Most new
> > laptop
> > > can already handle that (emphasis on laptop). And if we replace hash
> map
> > > (the culprit of high memory consumption) with list/linkedList, it would
> > > simply degrade time complexity for a linear search to O(n), not too bad
> > > either. The current DataModel is a result of careful thoughts and has
> > > underwent extensive test, it is easier to expand on top of it instead
> of
> > > subverting it.
> >
>

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
Mmm, You are right, the most simple solution is usually the best, I'm 
creating new jira ticket.

Yours Peng

On 13-07-17 04:58 PM, Sebastian Schelter wrote:
> Hi Peng,
>
> I never wanted to discard the old interface, I just wanted to split it up.
> I want to have a simple interface that only supports sequential access (and
> allows for very memory efficient implementions, e.g. by the use of
> primitive arrays). DataModel should *extend* this interface and provide
> sequential and random access (basically what is already does).
>
> Than a recommender such as SGD could state that it only needs sequential
> access to the preferences and you can either feed it a DataModel (so we
> don"t break backwards compatibility) or a memory efficient sequential
> access thingy.
>
> Does that make sense for you?
>
>
> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>
>> I see, OK so we shouldn't use the old implementation. But I mean, the old
>> interface doesn't have to be discarded. The discrepancy between your
>> FactorizablePreferences and DataModel is that, your model supports
>> getPreferences(), which returns all preferences as an iterator, and
>> DataModel supports a few old functions that returns preferences for an
>> individual user or item.
>>
>> My point is that, it is not hard for each of them to implement what they
>> lack of: old DataModel can implement getPreferences() just by a a loop in
>> abstract class. Your new FactorizablePreferences can implement those old
>> functions by a binary search that takes O(log n) time, or an interpolation
>> search that takes O(log log n) time in average. So does the online update.
>> It will just be a matter of different speed and space, but not different
>> interface standard, we can use old unit tests, old examples, old
>> everything. And we will be more flexible in writing ensemble recommender.
>>
>> Just a few thoughts, I'll have to validate the idea first before creating
>> a new JIRA ticket.
>>
>> Yours Peng
>>
>>
>>
>> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>>
>>> I completely agree, Netflix is less than one gigabye in a smart
>>> representation, 12x more memory is a nogo. The techniques used in
>>> FactorizablePreferences allow a much more memory efficient representation,
>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>> into
>>> 3GB with that approach.
>>>
>>>
>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>
>>>   Netflix is a small dataset.  12G for that seems quite excessive.
>>>> Note also that this is before you have done any work.
>>>>
>>>> Ideally, 100million observations should take << 1GB.
>>>>
>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>> wrote:
>>>>
>>>>   The second idea is indeed splendid, we should separate time-complexity
>>>>> first and space-complexity first implementation. What I'm not quite
>>>>> sure,
>>>>> is that if we really need to create two interfaces instead of one.
>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>
>>>> laptop
>>>>
>>>>> can already handle that (emphasis on laptop). And if we replace hash map
>>>>> (the culprit of high memory consumption) with list/linkedList, it would
>>>>> simply degrade time complexity for a linear search to O(n), not too bad
>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>> underwent extensive test, it is easier to expand on top of it instead of
>>>>> subverting it.
>>>>>
>>



Re: Regarding Online Recommenders

Posted by Gokhan Capan <gk...@gmail.com>.
Ok, I tested the MatrixBackedDataModel, and the heap size is reduced to 7G
for the Netflix Data, still large.

The same history is encoded in 2 SparseRowMatrices, one is row-indexed by
users and one is by item.

It has serious concurrency issues at several places, though (sets and
removes need to be thread-safe).

Best

Gokhan


On Sat, Jul 20, 2013 at 12:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:

> Hi,
>
> Just one simple question: Is the org.apache.mahout.math.**BinarySearch.binarySearch()
> function an optimized version of Arrays.binarySearch()? If it is not, why
> implement it again?
>
> Yours Peng
>
>
> On 13-07-17 06:31 PM, Sebastian Schelter wrote:
>
>> You are completely right, the simple interface would only be usable for
>> readonly / batch-updatable recommenders. Online recommenders might need
>> something different. I tried to widen the discussion here to discuss all
>> kinds of API changes in the recommenders that would be necessary in the
>> future.
>>
>>
>>
>> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>>
>>  One thing that suddenly comes to my mind is that, for a simple interface
>>> like FactorizablePreferences, maybe sequential READ in real time is
>>> possible, but sequential WRITE in O(1) time is Utopia. Because you need
>>> to
>>> flush out old preference with same user and item ID (in worst case it
>>> could
>>> be an interpolation search), otherwise you are permitting a user rating
>>> an
>>> item twice with different values. Considering how FileDataModel suppose
>>> to
>>> work (new files flush old files), maybe using the simple interface has
>>> less
>>> advantages than we used to believe.
>>>
>>>
>>> On 13-07-17 04:58 PM, Sebastian Schelter wrote:
>>>
>>>  Hi Peng,
>>>>
>>>> I never wanted to discard the old interface, I just wanted to split it
>>>> up.
>>>> I want to have a simple interface that only supports sequential access
>>>> (and
>>>> allows for very memory efficient implementions, e.g. by the use of
>>>> primitive arrays). DataModel should *extend* this interface and provide
>>>> sequential and random access (basically what is already does).
>>>>
>>>> Than a recommender such as SGD could state that it only needs sequential
>>>> access to the preferences and you can either feed it a DataModel (so we
>>>> don"t break backwards compatibility) or a memory efficient sequential
>>>> access thingy.
>>>>
>>>> Does that make sense for you?
>>>>
>>>>
>>>> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>>>>
>>>>   I see, OK so we shouldn't use the old implementation. But I mean, the
>>>> old
>>>>
>>>>> interface doesn't have to be discarded. The discrepancy between your
>>>>> FactorizablePreferences and DataModel is that, your model supports
>>>>> getPreferences(), which returns all preferences as an iterator, and
>>>>> DataModel supports a few old functions that returns preferences for an
>>>>> individual user or item.
>>>>>
>>>>> My point is that, it is not hard for each of them to implement what
>>>>> they
>>>>> lack of: old DataModel can implement getPreferences() just by a a loop
>>>>> in
>>>>> abstract class. Your new FactorizablePreferences can implement those
>>>>> old
>>>>> functions by a binary search that takes O(log n) time, or an
>>>>> interpolation
>>>>> search that takes O(log log n) time in average. So does the online
>>>>> update.
>>>>> It will just be a matter of different speed and space, but not
>>>>> different
>>>>> interface standard, we can use old unit tests, old examples, old
>>>>> everything. And we will be more flexible in writing ensemble
>>>>> recommender.
>>>>>
>>>>> Just a few thoughts, I'll have to validate the idea first before
>>>>> creating
>>>>> a new JIRA ticket.
>>>>>
>>>>> Yours Peng
>>>>>
>>>>>
>>>>>
>>>>> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>>>>>
>>>>>   I completely agree, Netflix is less than one gigabye in a smart
>>>>>
>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>>> representation,
>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>>>> into
>>>>>> 3GB with that approach.
>>>>>>
>>>>>>
>>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>>>
>>>>>>    Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>
>>>>>>  Note also that this is before you have done any work.
>>>>>>>
>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>
>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    The second idea is indeed splendid, we should separate
>>>>>>> time-complexity
>>>>>>>
>>>>>>>  first and space-complexity first implementation. What I'm not quite
>>>>>>>> sure,
>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>>>
>>>>>>>>   laptop
>>>>>>>>
>>>>>>>   can already handle that (emphasis on laptop). And if we replace
>>>>>>> hash
>>>>>>>
>>>>>>>> map
>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>>> would
>>>>>>>> simply degrade time complexity for a linear search to O(n), not too
>>>>>>>> bad
>>>>>>>> either. The current DataModel is a result of careful thoughts and
>>>>>>>> has
>>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>>> instead
>>>>>>>> of
>>>>>>>> subverting it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
>
>

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
Hi,

Just one simple question: Is the 
org.apache.mahout.math.BinarySearch.binarySearch() function an optimized 
version of Arrays.binarySearch()? If it is not, why implement it again?

Yours Peng

On 13-07-17 06:31 PM, Sebastian Schelter wrote:
> You are completely right, the simple interface would only be usable for
> readonly / batch-updatable recommenders. Online recommenders might need
> something different. I tried to widen the discussion here to discuss all
> kinds of API changes in the recommenders that would be necessary in the
> future.
>
>
>
> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>
>> One thing that suddenly comes to my mind is that, for a simple interface
>> like FactorizablePreferences, maybe sequential READ in real time is
>> possible, but sequential WRITE in O(1) time is Utopia. Because you need to
>> flush out old preference with same user and item ID (in worst case it could
>> be an interpolation search), otherwise you are permitting a user rating an
>> item twice with different values. Considering how FileDataModel suppose to
>> work (new files flush old files), maybe using the simple interface has less
>> advantages than we used to believe.
>>
>>
>> On 13-07-17 04:58 PM, Sebastian Schelter wrote:
>>
>>> Hi Peng,
>>>
>>> I never wanted to discard the old interface, I just wanted to split it up.
>>> I want to have a simple interface that only supports sequential access
>>> (and
>>> allows for very memory efficient implementions, e.g. by the use of
>>> primitive arrays). DataModel should *extend* this interface and provide
>>> sequential and random access (basically what is already does).
>>>
>>> Than a recommender such as SGD could state that it only needs sequential
>>> access to the preferences and you can either feed it a DataModel (so we
>>> don"t break backwards compatibility) or a memory efficient sequential
>>> access thingy.
>>>
>>> Does that make sense for you?
>>>
>>>
>>> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>>>
>>>   I see, OK so we shouldn't use the old implementation. But I mean, the old
>>>> interface doesn't have to be discarded. The discrepancy between your
>>>> FactorizablePreferences and DataModel is that, your model supports
>>>> getPreferences(), which returns all preferences as an iterator, and
>>>> DataModel supports a few old functions that returns preferences for an
>>>> individual user or item.
>>>>
>>>> My point is that, it is not hard for each of them to implement what they
>>>> lack of: old DataModel can implement getPreferences() just by a a loop in
>>>> abstract class. Your new FactorizablePreferences can implement those old
>>>> functions by a binary search that takes O(log n) time, or an
>>>> interpolation
>>>> search that takes O(log log n) time in average. So does the online
>>>> update.
>>>> It will just be a matter of different speed and space, but not different
>>>> interface standard, we can use old unit tests, old examples, old
>>>> everything. And we will be more flexible in writing ensemble recommender.
>>>>
>>>> Just a few thoughts, I'll have to validate the idea first before creating
>>>> a new JIRA ticket.
>>>>
>>>> Yours Peng
>>>>
>>>>
>>>>
>>>> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>>>>
>>>>   I completely agree, Netflix is less than one gigabye in a smart
>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>> FactorizablePreferences allow a much more memory efficient
>>>>> representation,
>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>>> into
>>>>> 3GB with that approach.
>>>>>
>>>>>
>>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>>
>>>>>    Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>
>>>>>> Note also that this is before you have done any work.
>>>>>>
>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>
>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>>> wrote:
>>>>>>
>>>>>>    The second idea is indeed splendid, we should separate
>>>>>> time-complexity
>>>>>>
>>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>>>> sure,
>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>>
>>>>>>>   laptop
>>>>>>   can already handle that (emphasis on laptop). And if we replace hash
>>>>>>> map
>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>> would
>>>>>>> simply degrade time complexity for a linear search to O(n), not too
>>>>>>> bad
>>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>>>>> of
>>>>>>> subverting it.
>>>>>>>
>>>>>>>
>>



Re: Regarding Online Recommenders

Posted by Sebastian Schelter <ss...@apache.org>.
You are completely right, the simple interface would only be usable for
readonly / batch-updatable recommenders. Online recommenders might need
something different. I tried to widen the discussion here to discuss all
kinds of API changes in the recommenders that would be necessary in the
future.



2013/7/17 Peng Cheng <pc...@uowmail.edu.au>

> One thing that suddenly comes to my mind is that, for a simple interface
> like FactorizablePreferences, maybe sequential READ in real time is
> possible, but sequential WRITE in O(1) time is Utopia. Because you need to
> flush out old preference with same user and item ID (in worst case it could
> be an interpolation search), otherwise you are permitting a user rating an
> item twice with different values. Considering how FileDataModel suppose to
> work (new files flush old files), maybe using the simple interface has less
> advantages than we used to believe.
>
>
> On 13-07-17 04:58 PM, Sebastian Schelter wrote:
>
>> Hi Peng,
>>
>> I never wanted to discard the old interface, I just wanted to split it up.
>> I want to have a simple interface that only supports sequential access
>> (and
>> allows for very memory efficient implementions, e.g. by the use of
>> primitive arrays). DataModel should *extend* this interface and provide
>> sequential and random access (basically what is already does).
>>
>> Than a recommender such as SGD could state that it only needs sequential
>> access to the preferences and you can either feed it a DataModel (so we
>> don"t break backwards compatibility) or a memory efficient sequential
>> access thingy.
>>
>> Does that make sense for you?
>>
>>
>> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>>
>>  I see, OK so we shouldn't use the old implementation. But I mean, the old
>>> interface doesn't have to be discarded. The discrepancy between your
>>> FactorizablePreferences and DataModel is that, your model supports
>>> getPreferences(), which returns all preferences as an iterator, and
>>> DataModel supports a few old functions that returns preferences for an
>>> individual user or item.
>>>
>>> My point is that, it is not hard for each of them to implement what they
>>> lack of: old DataModel can implement getPreferences() just by a a loop in
>>> abstract class. Your new FactorizablePreferences can implement those old
>>> functions by a binary search that takes O(log n) time, or an
>>> interpolation
>>> search that takes O(log log n) time in average. So does the online
>>> update.
>>> It will just be a matter of different speed and space, but not different
>>> interface standard, we can use old unit tests, old examples, old
>>> everything. And we will be more flexible in writing ensemble recommender.
>>>
>>> Just a few thoughts, I'll have to validate the idea first before creating
>>> a new JIRA ticket.
>>>
>>> Yours Peng
>>>
>>>
>>>
>>> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>>>
>>>  I completely agree, Netflix is less than one gigabye in a smart
>>>> representation, 12x more memory is a nogo. The techniques used in
>>>> FactorizablePreferences allow a much more memory efficient
>>>> representation,
>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>> into
>>>> 3GB with that approach.
>>>>
>>>>
>>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>>
>>>>   Netflix is a small dataset.  12G for that seems quite excessive.
>>>>
>>>>> Note also that this is before you have done any work.
>>>>>
>>>>> Ideally, 100million observations should take << 1GB.
>>>>>
>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>> wrote:
>>>>>
>>>>>   The second idea is indeed splendid, we should separate
>>>>> time-complexity
>>>>>
>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>>> sure,
>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>
>>>>>>  laptop
>>>>>
>>>>>  can already handle that (emphasis on laptop). And if we replace hash
>>>>>> map
>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>> would
>>>>>> simply degrade time complexity for a linear search to O(n), not too
>>>>>> bad
>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>>>> of
>>>>>> subverting it.
>>>>>>
>>>>>>
>>>
>
>

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
One thing that suddenly comes to my mind is that, for a simple interface 
like FactorizablePreferences, maybe sequential READ in real time is 
possible, but sequential WRITE in O(1) time is Utopia. Because you need 
to flush out old preference with same user and item ID (in worst case it 
could be an interpolation search), otherwise you are permitting a user 
rating an item twice with different values. Considering how 
FileDataModel suppose to work (new files flush old files), maybe using 
the simple interface has less advantages than we used to believe.

On 13-07-17 04:58 PM, Sebastian Schelter wrote:
> Hi Peng,
>
> I never wanted to discard the old interface, I just wanted to split it up.
> I want to have a simple interface that only supports sequential access (and
> allows for very memory efficient implementions, e.g. by the use of
> primitive arrays). DataModel should *extend* this interface and provide
> sequential and random access (basically what is already does).
>
> Than a recommender such as SGD could state that it only needs sequential
> access to the preferences and you can either feed it a DataModel (so we
> don"t break backwards compatibility) or a memory efficient sequential
> access thingy.
>
> Does that make sense for you?
>
>
> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>
>> I see, OK so we shouldn't use the old implementation. But I mean, the old
>> interface doesn't have to be discarded. The discrepancy between your
>> FactorizablePreferences and DataModel is that, your model supports
>> getPreferences(), which returns all preferences as an iterator, and
>> DataModel supports a few old functions that returns preferences for an
>> individual user or item.
>>
>> My point is that, it is not hard for each of them to implement what they
>> lack of: old DataModel can implement getPreferences() just by a a loop in
>> abstract class. Your new FactorizablePreferences can implement those old
>> functions by a binary search that takes O(log n) time, or an interpolation
>> search that takes O(log log n) time in average. So does the online update.
>> It will just be a matter of different speed and space, but not different
>> interface standard, we can use old unit tests, old examples, old
>> everything. And we will be more flexible in writing ensemble recommender.
>>
>> Just a few thoughts, I'll have to validate the idea first before creating
>> a new JIRA ticket.
>>
>> Yours Peng
>>
>>
>>
>> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>>
>>> I completely agree, Netflix is less than one gigabye in a smart
>>> representation, 12x more memory is a nogo. The techniques used in
>>> FactorizablePreferences allow a much more memory efficient representation,
>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>> into
>>> 3GB with that approach.
>>>
>>>
>>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>>
>>>   Netflix is a small dataset.  12G for that seems quite excessive.
>>>> Note also that this is before you have done any work.
>>>>
>>>> Ideally, 100million observations should take << 1GB.
>>>>
>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>> wrote:
>>>>
>>>>   The second idea is indeed splendid, we should separate time-complexity
>>>>> first and space-complexity first implementation. What I'm not quite
>>>>> sure,
>>>>> is that if we really need to create two interfaces instead of one.
>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>
>>>> laptop
>>>>
>>>>> can already handle that (emphasis on laptop). And if we replace hash map
>>>>> (the culprit of high memory consumption) with list/linkedList, it would
>>>>> simply degrade time complexity for a linear search to O(n), not too bad
>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>> underwent extensive test, it is easier to expand on top of it instead of
>>>>> subverting it.
>>>>>
>>



Re: Regarding Online Recommenders

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
don"t break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng <pc...@uowmail.edu.au>

> I see, OK so we shouldn't use the old implementation. But I mean, the old
> interface doesn't have to be discarded. The discrepancy between your
> FactorizablePreferences and DataModel is that, your model supports
> getPreferences(), which returns all preferences as an iterator, and
> DataModel supports a few old functions that returns preferences for an
> individual user or item.
>
> My point is that, it is not hard for each of them to implement what they
> lack of: old DataModel can implement getPreferences() just by a a loop in
> abstract class. Your new FactorizablePreferences can implement those old
> functions by a binary search that takes O(log n) time, or an interpolation
> search that takes O(log log n) time in average. So does the online update.
> It will just be a matter of different speed and space, but not different
> interface standard, we can use old unit tests, old examples, old
> everything. And we will be more flexible in writing ensemble recommender.
>
> Just a few thoughts, I'll have to validate the idea first before creating
> a new JIRA ticket.
>
> Yours Peng
>
>
>
> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>
>> I completely agree, Netflix is less than one gigabye in a smart
>> representation, 12x more memory is a nogo. The techniques used in
>> FactorizablePreferences allow a much more memory efficient representation,
>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>> into
>> 3GB with that approach.
>>
>>
>> 2013/7/16 Ted Dunning <te...@gmail.com>
>>
>>  Netflix is a small dataset.  12G for that seems quite excessive.
>>>
>>> Note also that this is before you have done any work.
>>>
>>> Ideally, 100million observations should take << 1GB.
>>>
>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>> wrote:
>>>
>>>  The second idea is indeed splendid, we should separate time-complexity
>>>> first and space-complexity first implementation. What I'm not quite
>>>> sure,
>>>> is that if we really need to create two interfaces instead of one.
>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>
>>> laptop
>>>
>>>> can already handle that (emphasis on laptop). And if we replace hash map
>>>> (the culprit of high memory consumption) with list/linkedList, it would
>>>> simply degrade time complexity for a linear search to O(n), not too bad
>>>> either. The current DataModel is a result of careful thoughts and has
>>>> underwent extensive test, it is easier to expand on top of it instead of
>>>> subverting it.
>>>>
>>>
>
>

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
I see, OK so we shouldn't use the old implementation. But I mean, the 
old interface doesn't have to be discarded. The discrepancy between your 
FactorizablePreferences and DataModel is that, your model supports 
getPreferences(), which returns all preferences as an iterator, and 
DataModel supports a few old functions that returns preferences for an 
individual user or item.

My point is that, it is not hard for each of them to implement what they 
lack of: old DataModel can implement getPreferences() just by a a loop 
in abstract class. Your new FactorizablePreferences can implement those 
old functions by a binary search that takes O(log n) time, or an 
interpolation search that takes O(log log n) time in average. So does 
the online update. It will just be a matter of different speed and 
space, but not different interface standard, we can use old unit tests, 
old examples, old everything. And we will be more flexible in writing 
ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before 
creating a new JIRA ticket.

Yours Peng


On 13-07-16 02:51 PM, Sebastian Schelter wrote:
> I completely agree, Netflix is less than one gigabye in a smart
> representation, 12x more memory is a nogo. The techniques used in
> FactorizablePreferences allow a much more memory efficient representation,
> tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
> 3GB with that approach.
>
>
> 2013/7/16 Ted Dunning <te...@gmail.com>
>
>> Netflix is a small dataset.  12G for that seems quite excessive.
>>
>> Note also that this is before you have done any work.
>>
>> Ideally, 100million observations should take << 1GB.
>>
>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>>
>>> The second idea is indeed splendid, we should separate time-complexity
>>> first and space-complexity first implementation. What I'm not quite sure,
>>> is that if we really need to create two interfaces instead of one.
>>> Personally, I think 12G heap space is not that high right? Most new
>> laptop
>>> can already handle that (emphasis on laptop). And if we replace hash map
>>> (the culprit of high memory consumption) with list/linkedList, it would
>>> simply degrade time complexity for a linear search to O(n), not too bad
>>> either. The current DataModel is a result of careful thoughts and has
>>> underwent extensive test, it is easier to expand on top of it instead of
>>> subverting it.



Re: Regarding Online Recommenders

Posted by Sebastian Schelter <ss...@apache.org>.
I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
3GB with that approach.


2013/7/16 Ted Dunning <te...@gmail.com>

> Netflix is a small dataset.  12G for that seems quite excessive.
>
> Note also that this is before you have done any work.
>
> Ideally, 100million observations should take << 1GB.
>
> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
>
> > The second idea is indeed splendid, we should separate time-complexity
> > first and space-complexity first implementation. What I'm not quite sure,
> > is that if we really need to create two interfaces instead of one.
> > Personally, I think 12G heap space is not that high right? Most new
> laptop
> > can already handle that (emphasis on laptop). And if we replace hash map
> > (the culprit of high memory consumption) with list/linkedList, it would
> > simply degrade time complexity for a linear search to O(n), not too bad
> > either. The current DataModel is a result of careful thoughts and has
> > underwent extensive test, it is easier to expand on top of it instead of
> > subverting it.
>

Re: Regarding Online Recommenders

Posted by Ted Dunning <te...@gmail.com>.
Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take << 1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:

> The second idea is indeed splendid, we should separate time-complexity
> first and space-complexity first implementation. What I'm not quite sure,
> is that if we really need to create two interfaces instead of one.
> Personally, I think 12G heap space is not that high right? Most new laptop
> can already handle that (emphasis on laptop). And if we replace hash map
> (the culprit of high memory consumption) with list/linkedList, it would
> simply degrade time complexity for a linear search to O(n), not too bad
> either. The current DataModel is a result of careful thoughts and has
> underwent extensive test, it is easier to expand on top of it instead of
> subverting it.

Re: Regarding Online Recommenders

Posted by Peng Cheng <pc...@uowmail.edu.au>.
Yeah, setPreference() and removePreference() shouldn't be there, but 
injecting Recommender back to DataModel is kind of a strong dependency, 
which may intermingle components for different concerns. Maybe we can do 
something to RefreshHelper class? e.g. push something into a swap field 
so the downstream of a refreshable chain can read it out. I have read 
Gokhan's UpdateAwareDataModel, and feel that it's probably too 
heavyweight for a model selector as every time he change the algorithm 
he has to re-register that.

The second idea is indeed splendid, we should separate time-complexity 
first and space-complexity first implementation. What I'm not quite 
sure, is that if we really need to create two interfaces instead of one. 
Personally, I think 12G heap space is not that high right? Most new 
laptop can already handle that (emphasis on laptop). And if we replace 
hash map (the culprit of high memory consumption) with list/linkedList, 
it would simply degrade time complexity for a linear search to O(n), not 
too bad either. The current DataModel is a result of careful thoughts 
and has underwent extensive test, it is easier to expand on top of it 
instead of subverting it.

All the best,
Yours Peng

On 13-07-16 01:05 AM, Sebastian Schelter wrote:
> Hi Gokhan,
>
> I like your proposals and I think this is an important discussion. Peng
> is also interested in working on online recommenders, so we should try
> to team up our efforts. I'd like to extend the discussion a little to
> related API changes, that I think are necessary.
>
> What do you think about completely removing the setPreference() and
> removePreference() methods from Recommender? I think they don't belong
> there for two reasons: First,  they duplicate functionality from
> DataModel and second, a lot of recommenders are read-only/train-once and
> cannot handle single preference updates anyway.
>
> I think we should have a DataModel implementation that can be updated
> and an online learning recommender should be able to register to be
> notified with updates.
>
> We should further more split up the DataModel interface into a hierarchy
> of three parts:
>
> First, a simple readonly interface that allows sequential access to the
> data (similar to FactorizablePreferences). This allows us to create
> memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
> that the current DataModel needs 12GB heap for the Netflix dataset (100M
> ratings) which is unacceptable. I was able to fit the KDD Music dataset
> (250M ratings) into 3GB with FactorizablePreferences.
>
> The second interface would extend the readonly interface and should
> resemble what DataModel is today: An easy-to-use in-memory
> implementation that trades high memory consumption for convenient random
> access.
>
> And finally the third interface would extend the second and provide
> tooling for online updates of the data.
>
> What do you think of that? Does it sound reasonable?
>
> --sebastian
>
>
>> The DataModel I imagine would follow the current API, where underlying
>> preference storage is replaced with a matrix.
>>
>> A Recommender would then use the DataModel and the OnlineLearner, where
>> Recommender#setPreference is delegated to DataModel#setPreference (like it
>> does now), and DataModel#setPreference triggers OnlineLearner#train.
>
>
>



Re: Regarding Online Recommenders

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Gokhan,

I like your proposals and I think this is an important discussion. Peng
is also interested in working on online recommenders, so we should try
to team up our efforts. I'd like to extend the discussion a little to
related API changes, that I think are necessary.

What do you think about completely removing the setPreference() and
removePreference() methods from Recommender? I think they don't belong
there for two reasons: First,  they duplicate functionality from
DataModel and second, a lot of recommenders are read-only/train-once and
cannot handle single preference updates anyway.

I think we should have a DataModel implementation that can be updated
and an online learning recommender should be able to register to be
notified with updates.

We should further more split up the DataModel interface into a hierarchy
of three parts:

First, a simple readonly interface that allows sequential access to the
data (similar to FactorizablePreferences). This allows us to create
memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
that the current DataModel needs 12GB heap for the Netflix dataset (100M
ratings) which is unacceptable. I was able to fit the KDD Music dataset
(250M ratings) into 3GB with FactorizablePreferences.

The second interface would extend the readonly interface and should
resemble what DataModel is today: An easy-to-use in-memory
implementation that trades high memory consumption for convenient random
access.

And finally the third interface would extend the second and provide
tooling for online updates of the data.

What do you think of that? Does it sound reasonable?

--sebastian


> The DataModel I imagine would follow the current API, where underlying
> preference storage is replaced with a matrix.
> 
> A Recommender would then use the DataModel and the OnlineLearner, where
> Recommender#setPreference is delegated to DataModel#setPreference (like it
> does now), and DataModel#setPreference triggers OnlineLearner#train.