You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Young <wo...@126.com> on 2010/07/19 10:19:48 UTC

How to combine boolean datamodel with datamodel

Hi,
I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
 
Thank you

Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Hi sebastian
Thank you for your reply.
I do not mean to precompute the candidate items for a user but each items' neighbors. I do not know whether this is an effective way. Because based on 10M dataset from grouplens, this method could take 8 seconds and I need to improve the performance.





>Hi Young,
>
>I don't know of a good way to precompute all candidate items for a user,
>I don't think this makes sense as the data might become to large. But
>the performance of getAllOtherItems(...) depends on the implementation
>of DataModel you use. If all your preference data fits into memory (a
>single preference needs 28 bytes according to Sean AFAIK) you could
>think about exporting the preferences to a file and use the
>FileDataModel implementation that loads all data into memory.
>
>--sebastian
>
>Am 20.07.2010 17:38, schrieb Young:
>> Hi again, 
>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency? 
>> Or is there other way to make the online-recommendation much faster?
>> Thank you.
>>
>>
>>
>>
>>   
>>> Yes you probably want a new, separate table. You have an extra step of
>>> computing some notion of similarity anyway, and you probably want to
>>> separate this table from your main data table anyhow for reasons of
>>> performance and business logic separation.
>>>
>>> 2010/7/19 Young <wo...@126.com>:
>>>     
>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>
>>>>
>>>>
>>>>
>>>>       
>>>>> No, you need one table (or view if you like) containing all data. If
>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>
>>>>> If you mean, can you use a table with preferences with a model that
>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>
>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>         
>>>>>> Hi,
>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>
>>>>>> Thank you
>>>>>>           
>>>>       
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Young,

I don't know of a good way to precompute all candidate items for a user,
I don't think this makes sense as the data might become to large. But
the performance of getAllOtherItems(...) depends on the implementation
of DataModel you use. If all your preference data fits into memory (a
single preference needs 28 bytes according to Sean AFAIK) you could
think about exporting the preferences to a file and use the
FileDataModel implementation that loads all data into memory.

--sebastian

Am 20.07.2010 17:38, schrieb Young:
> Hi again, 
> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency? 
> Or is there other way to make the online-recommendation much faster?
> Thank you.
>
>
>
>
>   
>> Yes you probably want a new, separate table. You have an extra step of
>> computing some notion of similarity anyway, and you probably want to
>> separate this table from your main data table anyhow for reasons of
>> performance and business logic separation.
>>
>> 2010/7/19 Young <wo...@126.com>:
>>     
>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>
>>>
>>>
>>>
>>>       
>>>> No, you need one table (or view if you like) containing all data. If
>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>> that can query multiple tables, or, that changes its SQL queries to
>>>> use UNION statements. I imagine it will slow down a lot.
>>>>
>>>> If you mean, can you use a table with preferences with a model that
>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>
>>>> 2010/7/19 Young <wo...@126.com>:
>>>>         
>>>>> Hi,
>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>
>>>>> Thank you
>>>>>           
>>>       


Re:Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Thanks Ted, too. :)





>Hi sebastian,
> 
>That makes sense. Thank you very much.
> 
>--- Young
>
>
>
>
>
>>I did some inspection on the grouplens dataset and it turned out that
>>Ted was absolutely right.
>>
>>I picked some random users and checked how many items
>>getAllOtherItemIDs(...) returns for them. Actually the whole dataset is
>>the result in most of the cases.
>>
>>So IMHO it's correct that the current implementation of
>>getAllOtherItemIDs(...) is not suited for this specific dataset.
>>
>>Young, should do some tests with your own data. If I remember correctly,
>>you said it's purchases from an onlineshop, these should result in a
>>much more sparse user-item-matrix and therefore much faster computations.
>>
>>--sebastian
>>
>>
>>Anatomy of the data:
>>
>>number of preferences:    1000209
>>number of items:        3706
>>number of users:        6040
>>number of users with more than 500 prefs:    396
>>number of items with more than 100 prefs:  2006
>>
>>
>>Random userID, number of candidate items:
>>
>>1950,    3567
>>2010,    2973
>>4193,    3444
>>734,    3658
>>4655,    3364
>>1569,    3611
>>3717,    3407
>>4313,    3608
>>195,    2884
>>3827,    3516
>>3803,    3671
>>3476,    3001
>>1912,    2759
>>1354,    3022
>>3961,    3475
>>2963,    3661
>>3381,    3661
>>5137,    3583
>>3870,    3675
>>2269,    3671
>>1843,    3586
>>5905,    3553
>>2067,    3506
>>456, 3548
>>477, 3495
>>
>>
>>
>>Am 21.07.2010 16:17, schrieb Ted Dunning:
>>> Is it possible that there are some items that all users see/rate/interact
>>> with?
>>>
>>> That can cause problems like this because all users are then somewhat
>>> similar and you wind up inspecting the entire rating matrix.
>>>
>>> Any such items should be added to a kill list.
>>>
>>> 2010/7/21 Sebastian Schelter <ss...@googlemail.com>
>>>
>>>   
>>>> Well,
>>>>
>>>> there must be something wrong here, I've seen systems in production with
>>>> similar sized data that responded clearly below 1 second all the time. I
>>>> really can't imagine
>>>> that FastIDSet would be the cause for this.
>>>>
>>>> Are you sure the JVM has all the machine for itself? No email
>>>> application in the background checking mails, no memory swapping of the OS?
>>>>
>>>> If you want you can make your data and test code available and I can
>>>> check it on my notebook.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>>
>>>>
>>>> Am 21.07.2010 11:23, schrieb Young:
>>>>     
>>>>> Hi again,
>>>>> I use java profiler to find out the latency comes from the
>>>>>       
>>>> FastIDSet.addAll() and FastIDSet.add()..
>>>>     
>>>>> Blows are source code.
>>>>>
>>>>>  protected FastIDSet getAllOtherItems(long theUserID) throws
>>>>>       
>>>> TasteException {
>>>>     
>>>>>   ......
>>>>>       for (int j = 0; j < size2; j++) {
>>>>>
>>>>>       
>>>> possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(prefs2.getUserID(j)));
>>>>     
>>>>>       }
>>>>>   ......
>>>>> }
>>>>>   public FastIDSet getItemIDsFromUser(long userID) throws TasteException
>>>>>       
>>>> {
>>>>     
>>>>>     PreferenceArray prefs = getPreferencesFromUser(userID);
>>>>>     int size = prefs.length();
>>>>>     FastIDSet result = new FastIDSet(size);
>>>>>     for (int i = 0; i < size; i++) {
>>>>>       result.add(prefs.getItemID(i));
>>>>>     }
>>>>>     return result;
>>>>>   }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       
>>>>>> Yes, I use the genericdatamodel which is in-memory.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>         
>>>>>>> Is only the similarity matrix in-memory? The crucial thing here is the
>>>>>>> data model not the similarity matrix, are you using an in-memory data
>>>>>>>           
>>>> model?
>>>>     
>>>>>>> Am 21.07.2010 08:33, schrieb Young:
>>>>>>>
>>>>>>>           
>>>>>>>> Yes, I am pretty sure. I have stored the similarity matrix in-memory
>>>>>>>>             
>>>> and I print out the time spent in getAllOtherItems() and this is the only
>>>> one time-consuming method in the recommendation. My laptop CPU is Intel
>>>> P8600 2.4G, and the memory used for JVM is 1GB.
>>>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>             
>>>>>>>>> Hi Young,
>>>>>>>>>
>>>>>>>>> I would disagree that a response time of 6 seconds is OK for online
>>>>>>>>> recommendations, the time should be something like < 100ms.
>>>>>>>>> I'm really surprised that you would see such response times with an
>>>>>>>>> in-memory data model, I have experience with in-memory models of
>>>>>>>>>               
>>>> roughly
>>>>     
>>>>>>>>> the same size
>>>>>>>>> and usually the computations are blazingly fast.
>>>>>>>>>
>>>>>>>>> Are you absolutely sure that the time is spent in this method and not
>>>>>>>>> later in the similarity computation?
>>>>>>>>>
>>>>>>>>> --sebastian
>>>>>>>>>
>>>>>>>>> Am 21.07.2010 07:54, schrieb Young:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>> So based on the 1M dataset, the time spent in
>>>>>>>>>>                 
>>>> getAllOtherItems(userID) is among the 2 and 10 seconds.
>>>>     
>>>>>>>>>> for example,
>>>>>>>>>> If one user rates 200 items and for each item, the time spent in
>>>>>>>>>>                 
>>>> calculating the neighbors is expected to 30ms.
>>>>     
>>>>>>>>>> So that makes 6 seconds. It is generally okay. But if the dataset is
>>>>>>>>>>                 
>>>> expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that
>>>> will be a long time.
>>>>     
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                 
>>>>>>>>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>>>>>>>>> what's going on.
>>>>>>>>>>> You are using an in-memory model like GenericDataModel?
>>>>>>>>>>> We could look at ways to optimize that method, though it looks
>>>>>>>>>>>                   
>>>> reasonably tight.
>>>>     
>>>>>>>>>>> Where within that method do you see time spent?
>>>>>>>>>>>
>>>>>>>>>>> 2010/7/20 Young <wo...@126.com>:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>> Hi again,
>>>>>>>>>>>> When I do the itembased recommendation, I find there are some
>>>>>>>>>>>>                     
>>>> latency in getAllOtherItems(long userID). Because it is calculating the
>>>> items' neighbors and merge these neighbors together. So I am thinking if I
>>>> precompute each item's neighbors and store in the database, then when I
>>>> getAllOtherItems(), I could merge these neighbors directly. Is this useful
>>>> for reducing the latency?
>>>>     
>>>>>>>>>>>> Or is there other way to make the online-recommendation much
>>>>>>>>>>>>                     
>>>> faster?
>>>>     
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>>>>> Yes you probably want a new, separate table. You have an extra
>>>>>>>>>>>>>                       
>>>> step of
>>>>     
>>>>>>>>>>>>> computing some notion of similarity anyway, and you probably want
>>>>>>>>>>>>>                       
>>>> to
>>>>     
>>>>>>>>>>>>> separate this table from your main data table anyhow for reasons
>>>>>>>>>>>>>                       
>>>> of
>>>>     
>>>>>>>>>>>>> performance and business logic separation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                       
>>>>>>>>>>>>>> So my prpblem is that I want to build datamodel based on what
>>>>>>>>>>>>>>                         
>>>> user has bought or added to their favorite or rated.
>>>>     
>>>>>>>>>>>>>> You mean I need a table describe all these user behavior. For
>>>>>>>>>>>>>>                         
>>>> example, if user buys one item, I guess the user preference is 4 and add
>>>> into this table?
>>>>     
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                         
>>>>>>>>>>>>>>> No, you need one table (or view if you like) containing all
>>>>>>>>>>>>>>>                           
>>>> data. If
>>>>     
>>>>>>>>>>>>>>> you can't do this, you could write your own copy of a
>>>>>>>>>>>>>>>                           
>>>> JDBCDataModel
>>>>     
>>>>>>>>>>>>>>> that can query multiple tables, or, that changes its SQL
>>>>>>>>>>>>>>>                           
>>>> queries to
>>>>     
>>>>>>>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you mean, can you use a table with preferences with a model
>>>>>>>>>>>>>>>                           
>>>> that
>>>>     
>>>>>>>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> I have three tables, one is with preference and another two
>>>>>>>>>>>>>>>>                             
>>>> are without preference. Does mahout have some algorithm to integret these
>>>> tables into one datamodel?
>>>>     
>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                         
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>
>>>>>>>>>               
>>>>>>>           
>>>>
>>>>     
>>>   
>>

Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Hi sebastian,
 
That makes sense. Thank you very much.
 
--- Young





>I did some inspection on the grouplens dataset and it turned out that
>Ted was absolutely right.
>
>I picked some random users and checked how many items
>getAllOtherItemIDs(...) returns for them. Actually the whole dataset is
>the result in most of the cases.
>
>So IMHO it's correct that the current implementation of
>getAllOtherItemIDs(...) is not suited for this specific dataset.
>
>Young, should do some tests with your own data. If I remember correctly,
>you said it's purchases from an onlineshop, these should result in a
>much more sparse user-item-matrix and therefore much faster computations.
>
>--sebastian
>
>
>Anatomy of the data:
>
>number of preferences:    1000209
>number of items:        3706
>number of users:        6040
>number of users with more than 500 prefs:    396
>number of items with more than 100 prefs:  2006
>
>
>Random userID, number of candidate items:
>
>1950,    3567
>2010,    2973
>4193,    3444
>734,    3658
>4655,    3364
>1569,    3611
>3717,    3407
>4313,    3608
>195,    2884
>3827,    3516
>3803,    3671
>3476,    3001
>1912,    2759
>1354,    3022
>3961,    3475
>2963,    3661
>3381,    3661
>5137,    3583
>3870,    3675
>2269,    3671
>1843,    3586
>5905,    3553
>2067,    3506
>456, 3548
>477, 3495
>
>
>
>Am 21.07.2010 16:17, schrieb Ted Dunning:
>> Is it possible that there are some items that all users see/rate/interact
>> with?
>>
>> That can cause problems like this because all users are then somewhat
>> similar and you wind up inspecting the entire rating matrix.
>>
>> Any such items should be added to a kill list.
>>
>> 2010/7/21 Sebastian Schelter <ss...@googlemail.com>
>>
>>   
>>> Well,
>>>
>>> there must be something wrong here, I've seen systems in production with
>>> similar sized data that responded clearly below 1 second all the time. I
>>> really can't imagine
>>> that FastIDSet would be the cause for this.
>>>
>>> Are you sure the JVM has all the machine for itself? No email
>>> application in the background checking mails, no memory swapping of the OS?
>>>
>>> If you want you can make your data and test code available and I can
>>> check it on my notebook.
>>>
>>> --sebastian
>>>
>>>
>>>
>>>
>>> Am 21.07.2010 11:23, schrieb Young:
>>>     
>>>> Hi again,
>>>> I use java profiler to find out the latency comes from the
>>>>       
>>> FastIDSet.addAll() and FastIDSet.add()..
>>>     
>>>> Blows are source code.
>>>>
>>>>  protected FastIDSet getAllOtherItems(long theUserID) throws
>>>>       
>>> TasteException {
>>>     
>>>>   ......
>>>>       for (int j = 0; j < size2; j++) {
>>>>
>>>>       
>>> possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(prefs2.getUserID(j)));
>>>     
>>>>       }
>>>>   ......
>>>> }
>>>>   public FastIDSet getItemIDsFromUser(long userID) throws TasteException
>>>>       
>>> {
>>>     
>>>>     PreferenceArray prefs = getPreferencesFromUser(userID);
>>>>     int size = prefs.length();
>>>>     FastIDSet result = new FastIDSet(size);
>>>>     for (int i = 0; i < size; i++) {
>>>>       result.add(prefs.getItemID(i));
>>>>     }
>>>>     return result;
>>>>   }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>       
>>>>> Yes, I use the genericdatamodel which is in-memory.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         
>>>>>> Is only the similarity matrix in-memory? The crucial thing here is the
>>>>>> data model not the similarity matrix, are you using an in-memory data
>>>>>>           
>>> model?
>>>     
>>>>>> Am 21.07.2010 08:33, schrieb Young:
>>>>>>
>>>>>>           
>>>>>>> Yes, I am pretty sure. I have stored the similarity matrix in-memory
>>>>>>>             
>>> and I print out the time spent in getAllOtherItems() and this is the only
>>> one time-consuming method in the recommendation. My laptop CPU is Intel
>>> P8600 2.4G, and the memory used for JVM is 1GB.
>>>     
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>             
>>>>>>>> Hi Young,
>>>>>>>>
>>>>>>>> I would disagree that a response time of 6 seconds is OK for online
>>>>>>>> recommendations, the time should be something like < 100ms.
>>>>>>>> I'm really surprised that you would see such response times with an
>>>>>>>> in-memory data model, I have experience with in-memory models of
>>>>>>>>               
>>> roughly
>>>     
>>>>>>>> the same size
>>>>>>>> and usually the computations are blazingly fast.
>>>>>>>>
>>>>>>>> Are you absolutely sure that the time is spent in this method and not
>>>>>>>> later in the similarity computation?
>>>>>>>>
>>>>>>>> --sebastian
>>>>>>>>
>>>>>>>> Am 21.07.2010 07:54, schrieb Young:
>>>>>>>>
>>>>>>>>
>>>>>>>>               
>>>>>>>>> So based on the 1M dataset, the time spent in
>>>>>>>>>                 
>>> getAllOtherItems(userID) is among the 2 and 10 seconds.
>>>     
>>>>>>>>> for example,
>>>>>>>>> If one user rates 200 items and for each item, the time spent in
>>>>>>>>>                 
>>> calculating the neighbors is expected to 30ms.
>>>     
>>>>>>>>> So that makes 6 seconds. It is generally okay. But if the dataset is
>>>>>>>>>                 
>>> expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that
>>> will be a long time.
>>>     
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                 
>>>>>>>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>>>>>>>> what's going on.
>>>>>>>>>> You are using an in-memory model like GenericDataModel?
>>>>>>>>>> We could look at ways to optimize that method, though it looks
>>>>>>>>>>                   
>>> reasonably tight.
>>>     
>>>>>>>>>> Where within that method do you see time spent?
>>>>>>>>>>
>>>>>>>>>> 2010/7/20 Young <wo...@126.com>:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                   
>>>>>>>>>>> Hi again,
>>>>>>>>>>> When I do the itembased recommendation, I find there are some
>>>>>>>>>>>                     
>>> latency in getAllOtherItems(long userID). Because it is calculating the
>>> items' neighbors and merge these neighbors together. So I am thinking if I
>>> precompute each item's neighbors and store in the database, then when I
>>> getAllOtherItems(), I could merge these neighbors directly. Is this useful
>>> for reducing the latency?
>>>     
>>>>>>>>>>> Or is there other way to make the online-recommendation much
>>>>>>>>>>>                     
>>> faster?
>>>     
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>>>>> Yes you probably want a new, separate table. You have an extra
>>>>>>>>>>>>                       
>>> step of
>>>     
>>>>>>>>>>>> computing some notion of similarity anyway, and you probably want
>>>>>>>>>>>>                       
>>> to
>>>     
>>>>>>>>>>>> separate this table from your main data table anyhow for reasons
>>>>>>>>>>>>                       
>>> of
>>>     
>>>>>>>>>>>> performance and business logic separation.
>>>>>>>>>>>>
>>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                       
>>>>>>>>>>>>> So my prpblem is that I want to build datamodel based on what
>>>>>>>>>>>>>                         
>>> user has bought or added to their favorite or rated.
>>>     
>>>>>>>>>>>>> You mean I need a table describe all these user behavior. For
>>>>>>>>>>>>>                         
>>> example, if user buys one item, I guess the user preference is 4 and add
>>> into this table?
>>>     
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                         
>>>>>>>>>>>>>> No, you need one table (or view if you like) containing all
>>>>>>>>>>>>>>                           
>>> data. If
>>>     
>>>>>>>>>>>>>> you can't do this, you could write your own copy of a
>>>>>>>>>>>>>>                           
>>> JDBCDataModel
>>>     
>>>>>>>>>>>>>> that can query multiple tables, or, that changes its SQL
>>>>>>>>>>>>>>                           
>>> queries to
>>>     
>>>>>>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you mean, can you use a table with preferences with a model
>>>>>>>>>>>>>>                           
>>> that
>>>     
>>>>>>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> I have three tables, one is with preference and another two
>>>>>>>>>>>>>>>                             
>>> are without preference. Does mahout have some algorithm to integret these
>>> tables into one datamodel?
>>>     
>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                         
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                     
>>>>>>>>
>>>>>>>>               
>>>>>>           
>>>
>>>     
>>   
>

Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Thank you very much.





>It's attached here: *https://issues.apache.org/jira/browse/MAHOUT-445*
>
>If you want to use the testcode you sent yesterday with the patch, you
>would need to change the way the recommender is created to:
>
>new GenericItemBasedRecommender(model, itemSimilarity, new
>AllUnknownItemsCandidateItemsStrategy())
>
>--sebastian
>
>Am 22.07.2010 15:15, schrieb Young:
>> Hi Sebastian,
>> Thank you. Where can we download the patch?
>>  
>> ---Young
>>
>>
>>
>>
>>
>>   
>>> Hi all,
>>>
>>> I did a little refactoring today to be able to inject customized ways of
>>> fetching the candidate items. I wrote another implementation that just
>>> returns all items not yet rated by the user. This won't be suitable for
>>> large datasets but it did quite well for the grouplens dataset (some
>>> testing results attached). I'm gonna create a patch so you can have a
>>> look at the refactoring and if you decide to commit it, it could be a
>>> suitable starting point for implementing Ted's proposed way of candidate
>>> item fetching.
>>>
>>> Another advantage of that patch is that users could supply use-case
>>> specific implementations of candidate item fetching without having to
>>> subclass the recommender of their choice.
>>>
>>> --sebastian
>>>
>>> Tests for random users with different candidate item fetching strategies
>>> (grouplens dataset)
>>>
>>> User 1063
>>> found 3605 items in 2376ms (current approach)
>>> found 3606 items in 1ms (all unknown items)
>>>
>>> User 3596
>>> found 3575 items in 1889ms (current approach)
>>> found 3578 items in 2ms (all unknown items)
>>>
>>> User 3300
>>> found 3343 items in 6603ms (current approach)
>>> found 3344 items in 0ms (all unknown items)
>>>
>>> User 924
>>> found 3507 items in 4173ms (current approach)
>>> found 3507 items in 4ms (all unknown items)
>>>
>>> User 4505
>>> found 3427 items in 4774ms (current approach)
>>> found 3427 items in 1ms (all unknown items)
>>>
>>> User 3378
>>> found 3471 items in 4225ms (current approach)
>>> found 3471 items in 0ms (all unknown items)
>>>
>>> User 246
>>> found 3673 items in 730ms (current approach)
>>> found 3677 items in 0ms (all unknown items)
>>>
>>>
>>> Am 22.07.2010 02:00, schrieb Ted Dunning:
>>>     
>>>> This is a ubiquitous problem with coocurrence algorithms since they scale in
>>>> the square of the number of occurrences most popular item.
>>>>
>>>> The good news is that you learn everything there is to learn about that item
>>>> if you look at just a sampling of the occurrences so sampling is your
>>>> friend.  If there is temporal structure, I tend to bias the sample toward
>>>> recent items.
>>>>
>>>> Regarding the size, I have generally had an arbitrary cutoff attached to a
>>>> configuration knob in my production systems.  It is probably reasonable to
>>>> set this limit to something like max(100, 20*log(max(N_users, N_items))).
>>>>  This isn't really any less arbitrary, but it will probably never need
>>>> tweaking in normal use.
>>>>   
>>>>       
>>>     
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
It's attached here: *https://issues.apache.org/jira/browse/MAHOUT-445*

If you want to use the testcode you sent yesterday with the patch, you
would need to change the way the recommender is created to:

new GenericItemBasedRecommender(model, itemSimilarity, new
AllUnknownItemsCandidateItemsStrategy())

--sebastian

Am 22.07.2010 15:15, schrieb Young:
> Hi Sebastian,
> Thank you. Where can we download the patch?
>  
> ---Young
>
>
>
>
>
>   
>> Hi all,
>>
>> I did a little refactoring today to be able to inject customized ways of
>> fetching the candidate items. I wrote another implementation that just
>> returns all items not yet rated by the user. This won't be suitable for
>> large datasets but it did quite well for the grouplens dataset (some
>> testing results attached). I'm gonna create a patch so you can have a
>> look at the refactoring and if you decide to commit it, it could be a
>> suitable starting point for implementing Ted's proposed way of candidate
>> item fetching.
>>
>> Another advantage of that patch is that users could supply use-case
>> specific implementations of candidate item fetching without having to
>> subclass the recommender of their choice.
>>
>> --sebastian
>>
>> Tests for random users with different candidate item fetching strategies
>> (grouplens dataset)
>>
>> User 1063
>> found 3605 items in 2376ms (current approach)
>> found 3606 items in 1ms (all unknown items)
>>
>> User 3596
>> found 3575 items in 1889ms (current approach)
>> found 3578 items in 2ms (all unknown items)
>>
>> User 3300
>> found 3343 items in 6603ms (current approach)
>> found 3344 items in 0ms (all unknown items)
>>
>> User 924
>> found 3507 items in 4173ms (current approach)
>> found 3507 items in 4ms (all unknown items)
>>
>> User 4505
>> found 3427 items in 4774ms (current approach)
>> found 3427 items in 1ms (all unknown items)
>>
>> User 3378
>> found 3471 items in 4225ms (current approach)
>> found 3471 items in 0ms (all unknown items)
>>
>> User 246
>> found 3673 items in 730ms (current approach)
>> found 3677 items in 0ms (all unknown items)
>>
>>
>> Am 22.07.2010 02:00, schrieb Ted Dunning:
>>     
>>> This is a ubiquitous problem with coocurrence algorithms since they scale in
>>> the square of the number of occurrences most popular item.
>>>
>>> The good news is that you learn everything there is to learn about that item
>>> if you look at just a sampling of the occurrences so sampling is your
>>> friend.  If there is temporal structure, I tend to bias the sample toward
>>> recent items.
>>>
>>> Regarding the size, I have generally had an arbitrary cutoff attached to a
>>> configuration knob in my production systems.  It is probably reasonable to
>>> set this limit to something like max(100, 20*log(max(N_users, N_items))).
>>>  This isn't really any less arbitrary, but it will probably never need
>>> tweaking in normal use.
>>>   
>>>       
>>     


Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Hi Sebastian,
Thank you. Where can we download the patch?
 
---Young





>Hi all,
>
>I did a little refactoring today to be able to inject customized ways of
>fetching the candidate items. I wrote another implementation that just
>returns all items not yet rated by the user. This won't be suitable for
>large datasets but it did quite well for the grouplens dataset (some
>testing results attached). I'm gonna create a patch so you can have a
>look at the refactoring and if you decide to commit it, it could be a
>suitable starting point for implementing Ted's proposed way of candidate
>item fetching.
>
>Another advantage of that patch is that users could supply use-case
>specific implementations of candidate item fetching without having to
>subclass the recommender of their choice.
>
>--sebastian
>
>Tests for random users with different candidate item fetching strategies
>(grouplens dataset)
>
>User 1063
>found 3605 items in 2376ms (current approach)
>found 3606 items in 1ms (all unknown items)
>
>User 3596
>found 3575 items in 1889ms (current approach)
>found 3578 items in 2ms (all unknown items)
>
>User 3300
>found 3343 items in 6603ms (current approach)
>found 3344 items in 0ms (all unknown items)
>
>User 924
>found 3507 items in 4173ms (current approach)
>found 3507 items in 4ms (all unknown items)
>
>User 4505
>found 3427 items in 4774ms (current approach)
>found 3427 items in 1ms (all unknown items)
>
>User 3378
>found 3471 items in 4225ms (current approach)
>found 3471 items in 0ms (all unknown items)
>
>User 246
>found 3673 items in 730ms (current approach)
>found 3677 items in 0ms (all unknown items)
>
>
>Am 22.07.2010 02:00, schrieb Ted Dunning:
>> This is a ubiquitous problem with coocurrence algorithms since they scale in
>> the square of the number of occurrences most popular item.
>>
>> The good news is that you learn everything there is to learn about that item
>> if you look at just a sampling of the occurrences so sampling is your
>> friend.  If there is temporal structure, I tend to bias the sample toward
>> recent items.
>>
>> Regarding the size, I have generally had an arbitrary cutoff attached to a
>> configuration knob in my production systems.  It is probably reasonable to
>> set this limit to something like max(100, 20*log(max(N_users, N_items))).
>>  This isn't really any less arbitrary, but it will probably never need
>> tweaking in normal use.
>>   
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi all,

I did a little refactoring today to be able to inject customized ways of
fetching the candidate items. I wrote another implementation that just
returns all items not yet rated by the user. This won't be suitable for
large datasets but it did quite well for the grouplens dataset (some
testing results attached). I'm gonna create a patch so you can have a
look at the refactoring and if you decide to commit it, it could be a
suitable starting point for implementing Ted's proposed way of candidate
item fetching.

Another advantage of that patch is that users could supply use-case
specific implementations of candidate item fetching without having to
subclass the recommender of their choice.

--sebastian

Tests for random users with different candidate item fetching strategies
(grouplens dataset)

User 1063
found 3605 items in 2376ms (current approach)
found 3606 items in 1ms (all unknown items)

User 3596
found 3575 items in 1889ms (current approach)
found 3578 items in 2ms (all unknown items)

User 3300
found 3343 items in 6603ms (current approach)
found 3344 items in 0ms (all unknown items)

User 924
found 3507 items in 4173ms (current approach)
found 3507 items in 4ms (all unknown items)

User 4505
found 3427 items in 4774ms (current approach)
found 3427 items in 1ms (all unknown items)

User 3378
found 3471 items in 4225ms (current approach)
found 3471 items in 0ms (all unknown items)

User 246
found 3673 items in 730ms (current approach)
found 3677 items in 0ms (all unknown items)


Am 22.07.2010 02:00, schrieb Ted Dunning:
> This is a ubiquitous problem with coocurrence algorithms since they scale in
> the square of the number of occurrences most popular item.
>
> The good news is that you learn everything there is to learn about that item
> if you look at just a sampling of the occurrences so sampling is your
> friend.  If there is temporal structure, I tend to bias the sample toward
> recent items.
>
> Regarding the size, I have generally had an arbitrary cutoff attached to a
> configuration knob in my production systems.  It is probably reasonable to
> set this limit to something like max(100, 20*log(max(N_users, N_items))).
>  This isn't really any less arbitrary, but it will probably never need
> tweaking in normal use.
>   


Re: How to combine boolean datamodel with datamodel

Posted by Ted Dunning <te...@gmail.com>.
This is a ubiquitous problem with coocurrence algorithms since they scale in
the square of the number of occurrences most popular item.

The good news is that you learn everything there is to learn about that item
if you look at just a sampling of the occurrences so sampling is your
friend.  If there is temporal structure, I tend to bias the sample toward
recent items.

Regarding the size, I have generally had an arbitrary cutoff attached to a
configuration knob in my production systems.  It is probably reasonable to
set this limit to something like max(100, 20*log(max(N_users, N_items))).
 This isn't really any less arbitrary, but it will probably never need
tweaking in normal use.

On Wed, Jul 21, 2010 at 12:32 PM, Sean Owen <sr...@gmail.com> wrote:

> Ah so it really is a function of those particular items.
> Well we can probably modify this function to be smarter and cap,
> somehow, the number of items considered.
> I'm just struggling to figure out how to do so without drawing
> arbitrary boundaries, like taking the top 100, etc.
>
> On Wed, Jul 21, 2010 at 8:46 PM, Sebastian Schelter
> <ss...@googlemail.com> wrote:
> > I did some inspection on the grouplens dataset and it turned out that
> > Ted was absolutely right.
>

Re: How to combine boolean datamodel with datamodel

Posted by Sean Owen <sr...@gmail.com>.
Ah so it really is a function of those particular items.
Well we can probably modify this function to be smarter and cap,
somehow, the number of items considered.
I'm just struggling to figure out how to do so without drawing
arbitrary boundaries, like taking the top 100, etc.

On Wed, Jul 21, 2010 at 8:46 PM, Sebastian Schelter
<ss...@googlemail.com> wrote:
> I did some inspection on the grouplens dataset and it turned out that
> Ted was absolutely right.
>
> I picked some random users and checked how many items
> getAllOtherItemIDs(...) returns for them. Actually the whole dataset is
> the result in most of the cases.
>
> So IMHO it's correct that the current implementation of
> getAllOtherItemIDs(...) is not suited for this specific dataset.
>
> Young, should do some tests with your own data. If I remember correctly,
> you said it's purchases from an onlineshop, these should result in a
> much more sparse user-item-matrix and therefore much faster computations.
>
> --sebastian
>
>
> Anatomy of the data:
>
> number of preferences:    1000209
> number of items:        3706
> number of users:        6040
> number of users with more than 500 prefs:    396
> number of items with more than 100 prefs:  2006
>
>
> Random userID, number of candidate items:
>
> 1950,    3567
> 2010,    2973
> 4193,    3444
> 734,    3658
> 4655,    3364
> 1569,    3611
> 3717,    3407
> 4313,    3608
> 195,    2884
> 3827,    3516
> 3803,    3671
> 3476,    3001
> 1912,    2759
> 1354,    3022
> 3961,    3475
> 2963,    3661
> 3381,    3661
> 5137,    3583
> 3870,    3675
> 2269,    3671
> 1843,    3586
> 5905,    3553
> 2067,    3506
> 456, 3548
> 477, 3495
>
>
>
> Am 21.07.2010 16:17, schrieb Ted Dunning:
>> Is it possible that there are some items that all users see/rate/interact
>> with?
>>
>> That can cause problems like this because all users are then somewhat
>> similar and you wind up inspecting the entire rating matrix.
>>
>> Any such items should be added to a kill list.
>>
>> 2010/7/21 Sebastian Schelter <ss...@googlemail.com>
>>
>>
>>> Well,
>>>
>>> there must be something wrong here, I've seen systems in production with
>>> similar sized data that responded clearly below 1 second all the time. I
>>> really can't imagine
>>> that FastIDSet would be the cause for this.
>>>
>>> Are you sure the JVM has all the machine for itself? No email
>>> application in the background checking mails, no memory swapping of the OS?
>>>
>>> If you want you can make your data and test code available and I can
>>> check it on my notebook.
>>>
>>> --sebastian
>>>
>>>
>>>
>>>
>>> Am 21.07.2010 11:23, schrieb Young:
>>>
>>>> Hi again,
>>>> I use java profiler to find out the latency comes from the
>>>>
>>> FastIDSet.addAll() and FastIDSet.add()..
>>>
>>>> Blows are source code.
>>>>
>>>>  protected FastIDSet getAllOtherItems(long theUserID) throws
>>>>
>>> TasteException {
>>>
>>>>   ......
>>>>       for (int j = 0; j < size2; j++) {
>>>>
>>>>
>>> possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(prefs2.getUserID(j)));
>>>
>>>>       }
>>>>   ......
>>>> }
>>>>   public FastIDSet getItemIDsFromUser(long userID) throws TasteException
>>>>
>>> {
>>>
>>>>     PreferenceArray prefs = getPreferencesFromUser(userID);
>>>>     int size = prefs.length();
>>>>     FastIDSet result = new FastIDSet(size);
>>>>     for (int i = 0; i < size; i++) {
>>>>       result.add(prefs.getItemID(i));
>>>>     }
>>>>     return result;
>>>>   }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Yes, I use the genericdatamodel which is in-memory.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Is only the similarity matrix in-memory? The crucial thing here is the
>>>>>> data model not the similarity matrix, are you using an in-memory data
>>>>>>
>>> model?
>>>
>>>>>> Am 21.07.2010 08:33, schrieb Young:
>>>>>>
>>>>>>
>>>>>>> Yes, I am pretty sure. I have stored the similarity matrix in-memory
>>>>>>>
>>> and I print out the time spent in getAllOtherItems() and this is the only
>>> one time-consuming method in the recommendation. My laptop CPU is Intel
>>> P8600 2.4G, and the memory used for JVM is 1GB.
>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi Young,
>>>>>>>>
>>>>>>>> I would disagree that a response time of 6 seconds is OK for online
>>>>>>>> recommendations, the time should be something like < 100ms.
>>>>>>>> I'm really surprised that you would see such response times with an
>>>>>>>> in-memory data model, I have experience with in-memory models of
>>>>>>>>
>>> roughly
>>>
>>>>>>>> the same size
>>>>>>>> and usually the computations are blazingly fast.
>>>>>>>>
>>>>>>>> Are you absolutely sure that the time is spent in this method and not
>>>>>>>> later in the similarity computation?
>>>>>>>>
>>>>>>>> --sebastian
>>>>>>>>
>>>>>>>> Am 21.07.2010 07:54, schrieb Young:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> So based on the 1M dataset, the time spent in
>>>>>>>>>
>>> getAllOtherItems(userID) is among the 2 and 10 seconds.
>>>
>>>>>>>>> for example,
>>>>>>>>> If one user rates 200 items and for each item, the time spent in
>>>>>>>>>
>>> calculating the neighbors is expected to 30ms.
>>>
>>>>>>>>> So that makes 6 seconds. It is generally okay. But if the dataset is
>>>>>>>>>
>>> expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that
>>> will be a long time.
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>>>>>>>> what's going on.
>>>>>>>>>> You are using an in-memory model like GenericDataModel?
>>>>>>>>>> We could look at ways to optimize that method, though it looks
>>>>>>>>>>
>>> reasonably tight.
>>>
>>>>>>>>>> Where within that method do you see time spent?
>>>>>>>>>>
>>>>>>>>>> 2010/7/20 Young <wo...@126.com>:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hi again,
>>>>>>>>>>> When I do the itembased recommendation, I find there are some
>>>>>>>>>>>
>>> latency in getAllOtherItems(long userID). Because it is calculating the
>>> items' neighbors and merge these neighbors together. So I am thinking if I
>>> precompute each item's neighbors and store in the database, then when I
>>> getAllOtherItems(), I could merge these neighbors directly. Is this useful
>>> for reducing the latency?
>>>
>>>>>>>>>>> Or is there other way to make the online-recommendation much
>>>>>>>>>>>
>>> faster?
>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Yes you probably want a new, separate table. You have an extra
>>>>>>>>>>>>
>>> step of
>>>
>>>>>>>>>>>> computing some notion of similarity anyway, and you probably want
>>>>>>>>>>>>
>>> to
>>>
>>>>>>>>>>>> separate this table from your main data table anyhow for reasons
>>>>>>>>>>>>
>>> of
>>>
>>>>>>>>>>>> performance and business logic separation.
>>>>>>>>>>>>
>>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> So my prpblem is that I want to build datamodel based on what
>>>>>>>>>>>>>
>>> user has bought or added to their favorite or rated.
>>>
>>>>>>>>>>>>> You mean I need a table describe all these user behavior. For
>>>>>>>>>>>>>
>>> example, if user buys one item, I guess the user preference is 4 and add
>>> into this table?
>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, you need one table (or view if you like) containing all
>>>>>>>>>>>>>>
>>> data. If
>>>
>>>>>>>>>>>>>> you can't do this, you could write your own copy of a
>>>>>>>>>>>>>>
>>> JDBCDataModel
>>>
>>>>>>>>>>>>>> that can query multiple tables, or, that changes its SQL
>>>>>>>>>>>>>>
>>> queries to
>>>
>>>>>>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you mean, can you use a table with preferences with a model
>>>>>>>>>>>>>>
>>> that
>>>
>>>>>>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> I have three tables, one is with preference and another two
>>>>>>>>>>>>>>>
>>> are without preference. Does mahout have some algorithm to integret these
>>> tables into one datamodel?
>>>
>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>
>>>
>>
>
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
I did some inspection on the grouplens dataset and it turned out that
Ted was absolutely right.

I picked some random users and checked how many items
getAllOtherItemIDs(...) returns for them. Actually the whole dataset is
the result in most of the cases.

So IMHO it's correct that the current implementation of
getAllOtherItemIDs(...) is not suited for this specific dataset.

Young, should do some tests with your own data. If I remember correctly,
you said it's purchases from an onlineshop, these should result in a
much more sparse user-item-matrix and therefore much faster computations.

--sebastian


Anatomy of the data:

number of preferences:    1000209
number of items:        3706
number of users:        6040
number of users with more than 500 prefs:    396
number of items with more than 100 prefs:  2006


Random userID, number of candidate items:

1950,    3567
2010,    2973
4193,    3444
734,    3658
4655,    3364
1569,    3611
3717,    3407
4313,    3608
195,    2884
3827,    3516
3803,    3671
3476,    3001
1912,    2759
1354,    3022
3961,    3475
2963,    3661
3381,    3661
5137,    3583
3870,    3675
2269,    3671
1843,    3586
5905,    3553
2067,    3506
456, 3548
477, 3495



Am 21.07.2010 16:17, schrieb Ted Dunning:
> Is it possible that there are some items that all users see/rate/interact
> with?
>
> That can cause problems like this because all users are then somewhat
> similar and you wind up inspecting the entire rating matrix.
>
> Any such items should be added to a kill list.
>
> 2010/7/21 Sebastian Schelter <ss...@googlemail.com>
>
>   
>> Well,
>>
>> there must be something wrong here, I've seen systems in production with
>> similar sized data that responded clearly below 1 second all the time. I
>> really can't imagine
>> that FastIDSet would be the cause for this.
>>
>> Are you sure the JVM has all the machine for itself? No email
>> application in the background checking mails, no memory swapping of the OS?
>>
>> If you want you can make your data and test code available and I can
>> check it on my notebook.
>>
>> --sebastian
>>
>>
>>
>>
>> Am 21.07.2010 11:23, schrieb Young:
>>     
>>> Hi again,
>>> I use java profiler to find out the latency comes from the
>>>       
>> FastIDSet.addAll() and FastIDSet.add()..
>>     
>>> Blows are source code.
>>>
>>>  protected FastIDSet getAllOtherItems(long theUserID) throws
>>>       
>> TasteException {
>>     
>>>   ......
>>>       for (int j = 0; j < size2; j++) {
>>>
>>>       
>> possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(prefs2.getUserID(j)));
>>     
>>>       }
>>>   ......
>>> }
>>>   public FastIDSet getItemIDsFromUser(long userID) throws TasteException
>>>       
>> {
>>     
>>>     PreferenceArray prefs = getPreferencesFromUser(userID);
>>>     int size = prefs.length();
>>>     FastIDSet result = new FastIDSet(size);
>>>     for (int i = 0; i < size; i++) {
>>>       result.add(prefs.getItemID(i));
>>>     }
>>>     return result;
>>>   }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>       
>>>> Yes, I use the genericdatamodel which is in-memory.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> Is only the similarity matrix in-memory? The crucial thing here is the
>>>>> data model not the similarity matrix, are you using an in-memory data
>>>>>           
>> model?
>>     
>>>>> Am 21.07.2010 08:33, schrieb Young:
>>>>>
>>>>>           
>>>>>> Yes, I am pretty sure. I have stored the similarity matrix in-memory
>>>>>>             
>> and I print out the time spent in getAllOtherItems() and this is the only
>> one time-consuming method in the recommendation. My laptop CPU is Intel
>> P8600 2.4G, and the memory used for JVM is 1GB.
>>     
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi Young,
>>>>>>>
>>>>>>> I would disagree that a response time of 6 seconds is OK for online
>>>>>>> recommendations, the time should be something like < 100ms.
>>>>>>> I'm really surprised that you would see such response times with an
>>>>>>> in-memory data model, I have experience with in-memory models of
>>>>>>>               
>> roughly
>>     
>>>>>>> the same size
>>>>>>> and usually the computations are blazingly fast.
>>>>>>>
>>>>>>> Are you absolutely sure that the time is spent in this method and not
>>>>>>> later in the similarity computation?
>>>>>>>
>>>>>>> --sebastian
>>>>>>>
>>>>>>> Am 21.07.2010 07:54, schrieb Young:
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> So based on the 1M dataset, the time spent in
>>>>>>>>                 
>> getAllOtherItems(userID) is among the 2 and 10 seconds.
>>     
>>>>>>>> for example,
>>>>>>>> If one user rates 200 items and for each item, the time spent in
>>>>>>>>                 
>> calculating the neighbors is expected to 30ms.
>>     
>>>>>>>> So that makes 6 seconds. It is generally okay. But if the dataset is
>>>>>>>>                 
>> expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that
>> will be a long time.
>>     
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>>>>>>> what's going on.
>>>>>>>>> You are using an in-memory model like GenericDataModel?
>>>>>>>>> We could look at ways to optimize that method, though it looks
>>>>>>>>>                   
>> reasonably tight.
>>     
>>>>>>>>> Where within that method do you see time spent?
>>>>>>>>>
>>>>>>>>> 2010/7/20 Young <wo...@126.com>:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> Hi again,
>>>>>>>>>> When I do the itembased recommendation, I find there are some
>>>>>>>>>>                     
>> latency in getAllOtherItems(long userID). Because it is calculating the
>> items' neighbors and merge these neighbors together. So I am thinking if I
>> precompute each item's neighbors and store in the database, then when I
>> getAllOtherItems(), I could merge these neighbors directly. Is this useful
>> for reducing the latency?
>>     
>>>>>>>>>> Or is there other way to make the online-recommendation much
>>>>>>>>>>                     
>> faster?
>>     
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> Yes you probably want a new, separate table. You have an extra
>>>>>>>>>>>                       
>> step of
>>     
>>>>>>>>>>> computing some notion of similarity anyway, and you probably want
>>>>>>>>>>>                       
>> to
>>     
>>>>>>>>>>> separate this table from your main data table anyhow for reasons
>>>>>>>>>>>                       
>> of
>>     
>>>>>>>>>>> performance and business logic separation.
>>>>>>>>>>>
>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>>>>> So my prpblem is that I want to build datamodel based on what
>>>>>>>>>>>>                         
>> user has bought or added to their favorite or rated.
>>     
>>>>>>>>>>>> You mean I need a table describe all these user behavior. For
>>>>>>>>>>>>                         
>> example, if user buys one item, I guess the user preference is 4 and add
>> into this table?
>>     
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>>>> No, you need one table (or view if you like) containing all
>>>>>>>>>>>>>                           
>> data. If
>>     
>>>>>>>>>>>>> you can't do this, you could write your own copy of a
>>>>>>>>>>>>>                           
>> JDBCDataModel
>>     
>>>>>>>>>>>>> that can query multiple tables, or, that changes its SQL
>>>>>>>>>>>>>                           
>> queries to
>>     
>>>>>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you mean, can you use a table with preferences with a model
>>>>>>>>>>>>>                           
>> that
>>     
>>>>>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I have three tables, one is with preference and another two
>>>>>>>>>>>>>>                             
>> are without preference. Does mahout have some algorithm to integret these
>> tables into one datamodel?
>>     
>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>
>>>>>>>               
>>>>>           
>>
>>     
>   


Re: How to combine boolean datamodel with datamodel

Posted by Ted Dunning <te...@gmail.com>.
Is it possible that there are some items that all users see/rate/interact
with?

That can cause problems like this because all users are then somewhat
similar and you wind up inspecting the entire rating matrix.

Any such items should be added to a kill list.

2010/7/21 Sebastian Schelter <ss...@googlemail.com>

> Well,
>
> there must be something wrong here, I've seen systems in production with
> similar sized data that responded clearly below 1 second all the time. I
> really can't imagine
> that FastIDSet would be the cause for this.
>
> Are you sure the JVM has all the machine for itself? No email
> application in the background checking mails, no memory swapping of the OS?
>
> If you want you can make your data and test code available and I can
> check it on my notebook.
>
> --sebastian
>
>
>
>
> Am 21.07.2010 11:23, schrieb Young:
> > Hi again,
> > I use java profiler to find out the latency comes from the
> FastIDSet.addAll() and FastIDSet.add()..
> >
> > Blows are source code.
> >
> >  protected FastIDSet getAllOtherItems(long theUserID) throws
> TasteException {
> >   ......
> >       for (int j = 0; j < size2; j++) {
> >
> possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(prefs2.getUserID(j)));
> >       }
> >   ......
> > }
> >   public FastIDSet getItemIDsFromUser(long userID) throws TasteException
> {
> >     PreferenceArray prefs = getPreferencesFromUser(userID);
> >     int size = prefs.length();
> >     FastIDSet result = new FastIDSet(size);
> >     for (int i = 0; i < size; i++) {
> >       result.add(prefs.getItemID(i));
> >     }
> >     return result;
> >   }
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> Yes, I use the genericdatamodel which is in-memory.
> >>
> >>
> >>
> >>
> >>
> >>
> >>> Is only the similarity matrix in-memory? The crucial thing here is the
> >>> data model not the similarity matrix, are you using an in-memory data
> model?
> >>>
> >>> Am 21.07.2010 08:33, schrieb Young:
> >>>
> >>>> Yes, I am pretty sure. I have stored the similarity matrix in-memory
> and I print out the time spent in getAllOtherItems() and this is the only
> one time-consuming method in the recommendation. My laptop CPU is Intel
> P8600 2.4G, and the memory used for JVM is 1GB.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> Hi Young,
> >>>>>
> >>>>> I would disagree that a response time of 6 seconds is OK for online
> >>>>> recommendations, the time should be something like < 100ms.
> >>>>> I'm really surprised that you would see such response times with an
> >>>>> in-memory data model, I have experience with in-memory models of
> roughly
> >>>>> the same size
> >>>>> and usually the computations are blazingly fast.
> >>>>>
> >>>>> Are you absolutely sure that the time is spent in this method and not
> >>>>> later in the similarity computation?
> >>>>>
> >>>>> --sebastian
> >>>>>
> >>>>> Am 21.07.2010 07:54, schrieb Young:
> >>>>>
> >>>>>
> >>>>>> So based on the 1M dataset, the time spent in
> getAllOtherItems(userID) is among the 2 and 10 seconds.
> >>>>>> for example,
> >>>>>> If one user rates 200 items and for each item, the time spent in
> calculating the neighbors is expected to 30ms.
> >>>>>> So that makes 6 seconds. It is generally okay. But if the dataset is
> expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that
> will be a long time.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> It still seems strange to observe such a bottleneck, I'm not sure
> >>>>>>> what's going on.
> >>>>>>> You are using an in-memory model like GenericDataModel?
> >>>>>>> We could look at ways to optimize that method, though it looks
> reasonably tight.
> >>>>>>> Where within that method do you see time spent?
> >>>>>>>
> >>>>>>> 2010/7/20 Young <wo...@126.com>:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi again,
> >>>>>>>> When I do the itembased recommendation, I find there are some
> latency in getAllOtherItems(long userID). Because it is calculating the
> items' neighbors and merge these neighbors together. So I am thinking if I
> precompute each item's neighbors and store in the database, then when I
> getAllOtherItems(), I could merge these neighbors directly. Is this useful
> for reducing the latency?
> >>>>>>>> Or is there other way to make the online-recommendation much
> faster?
> >>>>>>>> Thank you.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Yes you probably want a new, separate table. You have an extra
> step of
> >>>>>>>>> computing some notion of similarity anyway, and you probably want
> to
> >>>>>>>>> separate this table from your main data table anyhow for reasons
> of
> >>>>>>>>> performance and business logic separation.
> >>>>>>>>>
> >>>>>>>>> 2010/7/19 Young <wo...@126.com>:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> So my prpblem is that I want to build datamodel based on what
> user has bought or added to their favorite or rated.
> >>>>>>>>>> You mean I need a table describe all these user behavior. For
> example, if user buys one item, I guess the user preference is 4 and add
> into this table?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> No, you need one table (or view if you like) containing all
> data. If
> >>>>>>>>>>> you can't do this, you could write your own copy of a
> JDBCDataModel
> >>>>>>>>>>> that can query multiple tables, or, that changes its SQL
> queries to
> >>>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
> >>>>>>>>>>>
> >>>>>>>>>>> If you mean, can you use a table with preferences with a model
> that
> >>>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
> >>>>>>>>>>>
> >>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> I have three tables, one is with preference and another two
> are without preference. Does mahout have some algorithm to integret these
> tables into one datamodel?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thank you
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>>
> >>>
>
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
Well,

there must be something wrong here, I've seen systems in production with
similar sized data that responded clearly below 1 second all the time. I
really can't imagine
that FastIDSet would be the cause for this.

Are you sure the JVM has all the machine for itself? No email
application in the background checking mails, no memory swapping of the OS?

If you want you can make your data and test code available and I can
check it on my notebook.

--sebastian




Am 21.07.2010 11:23, schrieb Young:
> Hi again, 
> I use java profiler to find out the latency comes from the FastIDSet.addAll() and FastIDSet.add()..
>  
> Blows are source code.
>  
>  protected FastIDSet getAllOtherItems(long theUserID) throws TasteException {
>   ......
>       for (int j = 0; j < size2; j++) {
>         possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(prefs2.getUserID(j)));
>       }
>   ......
> }
>   public FastIDSet getItemIDsFromUser(long userID) throws TasteException {
>     PreferenceArray prefs = getPreferencesFromUser(userID);
>     int size = prefs.length();
>     FastIDSet result = new FastIDSet(size);
>     for (int i = 0; i < size; i++) {
>       result.add(prefs.getItemID(i));
>     }
>     return result;
>   } 
>  
>  
>
>  
>
>
>
>
>   
>> Yes, I use the genericdatamodel which is in-memory. 
>>
>>
>>
>>
>>
>>     
>>> Is only the similarity matrix in-memory? The crucial thing here is the
>>> data model not the similarity matrix, are you using an in-memory data model?
>>>
>>> Am 21.07.2010 08:33, schrieb Young:
>>>       
>>>> Yes, I am pretty sure. I have stored the similarity matrix in-memory and I print out the time spent in getAllOtherItems() and this is the only one time-consuming method in the recommendation. My laptop CPU is Intel P8600 2.4G, and the memory used for JVM is 1GB. 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>   
>>>>         
>>>>> Hi Young,
>>>>>
>>>>> I would disagree that a response time of 6 seconds is OK for online
>>>>> recommendations, the time should be something like < 100ms.
>>>>> I'm really surprised that you would see such response times with an
>>>>> in-memory data model, I have experience with in-memory models of roughly
>>>>> the same size
>>>>> and usually the computations are blazingly fast.
>>>>>
>>>>> Are you absolutely sure that the time is spent in this method and not
>>>>> later in the similarity computation?
>>>>>
>>>>> --sebastian
>>>>>
>>>>> Am 21.07.2010 07:54, schrieb Young:
>>>>>     
>>>>>           
>>>>>> So based on the 1M dataset, the time spent in getAllOtherItems(userID) is among the 2 and 10 seconds. 
>>>>>> for example,
>>>>>> If one user rates 200 items and for each item, the time spent in calculating the neighbors is expected to 30ms. 
>>>>>> So that makes 6 seconds. It is generally okay. But if the dataset is expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that will be a long time. 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   
>>>>>>       
>>>>>>             
>>>>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>>>>> what's going on.
>>>>>>> You are using an in-memory model like GenericDataModel?
>>>>>>> We could look at ways to optimize that method, though it looks reasonably tight.
>>>>>>> Where within that method do you see time spent?
>>>>>>>
>>>>>>> 2010/7/20 Young <wo...@126.com>:
>>>>>>>     
>>>>>>>         
>>>>>>>               
>>>>>>>> Hi again,
>>>>>>>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>>>>>>>> Or is there other way to make the online-recommendation much faster?
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>       
>>>>>>>>           
>>>>>>>>                 
>>>>>>>>> Yes you probably want a new, separate table. You have an extra step of
>>>>>>>>> computing some notion of similarity anyway, and you probably want to
>>>>>>>>> separate this table from your main data table anyhow for reasons of
>>>>>>>>> performance and business logic separation.
>>>>>>>>>
>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>         
>>>>>>>>>             
>>>>>>>>>                   
>>>>>>>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>>>>>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>           
>>>>>>>>>>               
>>>>>>>>>>                     
>>>>>>>>>>> No, you need one table (or view if you like) containing all data. If
>>>>>>>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>>>>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>>>>
>>>>>>>>>>> If you mean, can you use a table with preferences with a model that
>>>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>>>>
>>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>>             
>>>>>>>>>>>                 
>>>>>>>>>>>                       
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you
>>>>>>>>>>>>               
>>>>>>>>>>>>                   
>>>>>>>>>>>>                         
>>>>>>>>>>           
>>>>>>>>>>               
>>>>>>>>>>                     
>>>>>>>>       
>>>>>>>>           
>>>>>>>>                 
>>>>>     
>>>>>           
>>>       


Re:Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Hi again, 
I use java profiler to find out the latency comes from the FastIDSet.addAll() and FastIDSet.add()..
 
Blows are source code.
 
 protected FastIDSet getAllOtherItems(long theUserID) throws TasteException {
  ......
      for (int j = 0; j < size2; j++) {
        possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(prefs2.getUserID(j)));
      }
  ......
}
  public FastIDSet getItemIDsFromUser(long userID) throws TasteException {
    PreferenceArray prefs = getPreferencesFromUser(userID);
    int size = prefs.length();
    FastIDSet result = new FastIDSet(size);
    for (int i = 0; i < size; i++) {
      result.add(prefs.getItemID(i));
    }
    return result;
  } 
 
 

 




>Yes, I use the genericdatamodel which is in-memory. 
>
>
>
>
>
>>Is only the similarity matrix in-memory? The crucial thing here is the
>>data model not the similarity matrix, are you using an in-memory data model?
>>
>>Am 21.07.2010 08:33, schrieb Young:
>>> Yes, I am pretty sure. I have stored the similarity matrix in-memory and I print out the time spent in getAllOtherItems() and this is the only one time-consuming method in the recommendation. My laptop CPU is Intel P8600 2.4G, and the memory used for JVM is 1GB. 
>>>
>>>
>>>
>>>
>>>
>>>   
>>>> Hi Young,
>>>>
>>>> I would disagree that a response time of 6 seconds is OK for online
>>>> recommendations, the time should be something like < 100ms.
>>>> I'm really surprised that you would see such response times with an
>>>> in-memory data model, I have experience with in-memory models of roughly
>>>> the same size
>>>> and usually the computations are blazingly fast.
>>>>
>>>> Are you absolutely sure that the time is spent in this method and not
>>>> later in the similarity computation?
>>>>
>>>> --sebastian
>>>>
>>>> Am 21.07.2010 07:54, schrieb Young:
>>>>     
>>>>> So based on the 1M dataset, the time spent in getAllOtherItems(userID) is among the 2 and 10 seconds. 
>>>>> for example,
>>>>> If one user rates 200 items and for each item, the time spent in calculating the neighbors is expected to 30ms. 
>>>>> So that makes 6 seconds. It is generally okay. But if the dataset is expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that will be a long time. 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   
>>>>>       
>>>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>>>> what's going on.
>>>>>> You are using an in-memory model like GenericDataModel?
>>>>>> We could look at ways to optimize that method, though it looks reasonably tight.
>>>>>> Where within that method do you see time spent?
>>>>>>
>>>>>> 2010/7/20 Young <wo...@126.com>:
>>>>>>     
>>>>>>         
>>>>>>> Hi again,
>>>>>>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>>>>>>> Or is there other way to make the online-recommendation much faster?
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>> Yes you probably want a new, separate table. You have an extra step of
>>>>>>>> computing some notion of similarity anyway, and you probably want to
>>>>>>>> separate this table from your main data table anyhow for reasons of
>>>>>>>> performance and business logic separation.
>>>>>>>>
>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>>>>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>> No, you need one table (or view if you like) containing all data. If
>>>>>>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>>>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>>>
>>>>>>>>>> If you mean, can you use a table with preferences with a model that
>>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>>>
>>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>>             
>>>>>>>>>>                 
>>>>>>>>>>> Hi,
>>>>>>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>>>>>>
>>>>>>>>>>> Thank you
>>>>>>>>>>>               
>>>>>>>>>>>                   
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>       
>>>>>>>           
>>>>     
>>

Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Yes, I use the genericdatamodel which is in-memory. 





>Is only the similarity matrix in-memory? The crucial thing here is the
>data model not the similarity matrix, are you using an in-memory data model?
>
>Am 21.07.2010 08:33, schrieb Young:
>> Yes, I am pretty sure. I have stored the similarity matrix in-memory and I print out the time spent in getAllOtherItems() and this is the only one time-consuming method in the recommendation. My laptop CPU is Intel P8600 2.4G, and the memory used for JVM is 1GB. 
>>
>>
>>
>>
>>
>>   
>>> Hi Young,
>>>
>>> I would disagree that a response time of 6 seconds is OK for online
>>> recommendations, the time should be something like < 100ms.
>>> I'm really surprised that you would see such response times with an
>>> in-memory data model, I have experience with in-memory models of roughly
>>> the same size
>>> and usually the computations are blazingly fast.
>>>
>>> Are you absolutely sure that the time is spent in this method and not
>>> later in the similarity computation?
>>>
>>> --sebastian
>>>
>>> Am 21.07.2010 07:54, schrieb Young:
>>>     
>>>> So based on the 1M dataset, the time spent in getAllOtherItems(userID) is among the 2 and 10 seconds. 
>>>> for example,
>>>> If one user rates 200 items and for each item, the time spent in calculating the neighbors is expected to 30ms. 
>>>> So that makes 6 seconds. It is generally okay. But if the dataset is expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that will be a long time. 
>>>>
>>>>
>>>>
>>>>
>>>>   
>>>>       
>>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>>> what's going on.
>>>>> You are using an in-memory model like GenericDataModel?
>>>>> We could look at ways to optimize that method, though it looks reasonably tight.
>>>>> Where within that method do you see time spent?
>>>>>
>>>>> 2010/7/20 Young <wo...@126.com>:
>>>>>     
>>>>>         
>>>>>> Hi again,
>>>>>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>>>>>> Or is there other way to make the online-recommendation much faster?
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>       
>>>>>>           
>>>>>>> Yes you probably want a new, separate table. You have an extra step of
>>>>>>> computing some notion of similarity anyway, and you probably want to
>>>>>>> separate this table from your main data table anyhow for reasons of
>>>>>>> performance and business logic separation.
>>>>>>>
>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>         
>>>>>>>             
>>>>>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>>>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>           
>>>>>>>>               
>>>>>>>>> No, you need one table (or view if you like) containing all data. If
>>>>>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>>
>>>>>>>>> If you mean, can you use a table with preferences with a model that
>>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>>
>>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>>             
>>>>>>>>>                 
>>>>>>>>>> Hi,
>>>>>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>>>>>
>>>>>>>>>> Thank you
>>>>>>>>>>               
>>>>>>>>>>                   
>>>>>>>>           
>>>>>>>>               
>>>>>>       
>>>>>>           
>>>     
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
Is only the similarity matrix in-memory? The crucial thing here is the
data model not the similarity matrix, are you using an in-memory data model?

Am 21.07.2010 08:33, schrieb Young:
> Yes, I am pretty sure. I have stored the similarity matrix in-memory and I print out the time spent in getAllOtherItems() and this is the only one time-consuming method in the recommendation. My laptop CPU is Intel P8600 2.4G, and the memory used for JVM is 1GB. 
>
>
>
>
>
>   
>> Hi Young,
>>
>> I would disagree that a response time of 6 seconds is OK for online
>> recommendations, the time should be something like < 100ms.
>> I'm really surprised that you would see such response times with an
>> in-memory data model, I have experience with in-memory models of roughly
>> the same size
>> and usually the computations are blazingly fast.
>>
>> Are you absolutely sure that the time is spent in this method and not
>> later in the similarity computation?
>>
>> --sebastian
>>
>> Am 21.07.2010 07:54, schrieb Young:
>>     
>>> So based on the 1M dataset, the time spent in getAllOtherItems(userID) is among the 2 and 10 seconds. 
>>> for example,
>>> If one user rates 200 items and for each item, the time spent in calculating the neighbors is expected to 30ms. 
>>> So that makes 6 seconds. It is generally okay. But if the dataset is expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that will be a long time. 
>>>
>>>
>>>
>>>
>>>   
>>>       
>>>> It still seems strange to observe such a bottleneck, I'm not sure
>>>> what's going on.
>>>> You are using an in-memory model like GenericDataModel?
>>>> We could look at ways to optimize that method, though it looks reasonably tight.
>>>> Where within that method do you see time spent?
>>>>
>>>> 2010/7/20 Young <wo...@126.com>:
>>>>     
>>>>         
>>>>> Hi again,
>>>>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>>>>> Or is there other way to make the online-recommendation much faster?
>>>>> Thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       
>>>>>           
>>>>>> Yes you probably want a new, separate table. You have an extra step of
>>>>>> computing some notion of similarity anyway, and you probably want to
>>>>>> separate this table from your main data table anyhow for reasons of
>>>>>> performance and business logic separation.
>>>>>>
>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>         
>>>>>>             
>>>>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>>>> No, you need one table (or view if you like) containing all data. If
>>>>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>>
>>>>>>>> If you mean, can you use a table with preferences with a model that
>>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>>
>>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Hi,
>>>>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>>>>
>>>>>>>>> Thank you
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>           
>>>>>>>               
>>>>>       
>>>>>           
>>     


Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Yes, I am pretty sure. I have stored the similarity matrix in-memory and I print out the time spent in getAllOtherItems() and this is the only one time-consuming method in the recommendation. My laptop CPU is Intel P8600 2.4G, and the memory used for JVM is 1GB. 





>Hi Young,
>
>I would disagree that a response time of 6 seconds is OK for online
>recommendations, the time should be something like < 100ms.
>I'm really surprised that you would see such response times with an
>in-memory data model, I have experience with in-memory models of roughly
>the same size
>and usually the computations are blazingly fast.
>
>Are you absolutely sure that the time is spent in this method and not
>later in the similarity computation?
>
>--sebastian
>
>Am 21.07.2010 07:54, schrieb Young:
>> So based on the 1M dataset, the time spent in getAllOtherItems(userID) is among the 2 and 10 seconds. 
>> for example,
>> If one user rates 200 items and for each item, the time spent in calculating the neighbors is expected to 30ms. 
>> So that makes 6 seconds. It is generally okay. But if the dataset is expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that will be a long time. 
>>
>>
>>
>>
>>   
>>> It still seems strange to observe such a bottleneck, I'm not sure
>>> what's going on.
>>> You are using an in-memory model like GenericDataModel?
>>> We could look at ways to optimize that method, though it looks reasonably tight.
>>> Where within that method do you see time spent?
>>>
>>> 2010/7/20 Young <wo...@126.com>:
>>>     
>>>> Hi again,
>>>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>>>> Or is there other way to make the online-recommendation much faster?
>>>> Thank you.
>>>>
>>>>
>>>>
>>>>
>>>>       
>>>>> Yes you probably want a new, separate table. You have an extra step of
>>>>> computing some notion of similarity anyway, and you probably want to
>>>>> separate this table from your main data table anyhow for reasons of
>>>>> performance and business logic separation.
>>>>>
>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>         
>>>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>>>> No, you need one table (or view if you like) containing all data. If
>>>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>>
>>>>>>> If you mean, can you use a table with preferences with a model that
>>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>>
>>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>>             
>>>>>>>> Hi,
>>>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>>>
>>>>>>>> Thank you
>>>>>>>>               
>>>>>>           
>>>>       
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Young,

I would disagree that a response time of 6 seconds is OK for online
recommendations, the time should be something like < 100ms.
I'm really surprised that you would see such response times with an
in-memory data model, I have experience with in-memory models of roughly
the same size
and usually the computations are blazingly fast.

Are you absolutely sure that the time is spent in this method and not
later in the similarity computation?

--sebastian

Am 21.07.2010 07:54, schrieb Young:
> So based on the 1M dataset, the time spent in getAllOtherItems(userID) is among the 2 and 10 seconds. 
> for example,
> If one user rates 200 items and for each item, the time spent in calculating the neighbors is expected to 30ms. 
> So that makes 6 seconds. It is generally okay. But if the dataset is expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that will be a long time. 
>
>
>
>
>   
>> It still seems strange to observe such a bottleneck, I'm not sure
>> what's going on.
>> You are using an in-memory model like GenericDataModel?
>> We could look at ways to optimize that method, though it looks reasonably tight.
>> Where within that method do you see time spent?
>>
>> 2010/7/20 Young <wo...@126.com>:
>>     
>>> Hi again,
>>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>>> Or is there other way to make the online-recommendation much faster?
>>> Thank you.
>>>
>>>
>>>
>>>
>>>       
>>>> Yes you probably want a new, separate table. You have an extra step of
>>>> computing some notion of similarity anyway, and you probably want to
>>>> separate this table from your main data table anyhow for reasons of
>>>> performance and business logic separation.
>>>>
>>>> 2010/7/19 Young <wo...@126.com>:
>>>>         
>>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> No, you need one table (or view if you like) containing all data. If
>>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>
>>>>>> If you mean, can you use a table with preferences with a model that
>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>
>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>             
>>>>>>> Hi,
>>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>>
>>>>>>> Thank you
>>>>>>>               
>>>>>           
>>>       


Re:Re: Re: Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
So based on the 1M dataset, the time spent in getAllOtherItems(userID) is among the 2 and 10 seconds. 
for example,
If one user rates 200 items and for each item, the time spent in calculating the neighbors is expected to 30ms. 
So that makes 6 seconds. It is generally okay. But if the dataset is expanded to 100M dataset, I think 30ms may grow up to 30 * 100 ms and that will be a long time. 




>It still seems strange to observe such a bottleneck, I'm not sure
>what's going on.
>You are using an in-memory model like GenericDataModel?
>We could look at ways to optimize that method, though it looks reasonably tight.
>Where within that method do you see time spent?
>
>2010/7/20 Young <wo...@126.com>:
>> Hi again,
>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>> Or is there other way to make the online-recommendation much faster?
>> Thank you.
>>
>>
>>
>>
>>>Yes you probably want a new, separate table. You have an extra step of
>>>computing some notion of similarity anyway, and you probably want to
>>>separate this table from your main data table anyhow for reasons of
>>>performance and business logic separation.
>>>
>>>2010/7/19 Young <wo...@126.com>:
>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>
>>>>
>>>>
>>>>
>>>>>No, you need one table (or view if you like) containing all data. If
>>>>>you can't do this, you could write your own copy of a JDBCDataModel
>>>>>that can query multiple tables, or, that changes its SQL queries to
>>>>>use UNION statements. I imagine it will slow down a lot.
>>>>>
>>>>>If you mean, can you use a table with preferences with a model that
>>>>>ignores preferences, sure you can. The extra column is ignored.
>>>>>
>>>>>2010/7/19 Young <wo...@126.com>:
>>>>>> Hi,
>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>
>>>>>> Thank you
>>>>
>>

Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Hi sebastian
Thank you. My observation is based on in-memory datamodel.
 
----Young





>Hi Young,
>
>I guess you are using a jdbc-based DataModel.
>
>If I understand getAllOtherItemIDs(...) correctly, it's doing the following:
>
>for each item i preferred by the current user
>  for each user u preferring i
>    fetch other items preferred by u
>
>This method could issue a lot of database queries if you encounter items
>that are preferred by a lot of users. You could check whether you have
>set the correct indexes on the database tables and generally check your
>database connectivity (do the calls have to go over the network?), but I
>think the best thing would be to use an in-memory data model, which
>should not be a problem with 10M preferences.
>
>--sebastian
>
>
>Am 20.07.2010 20:04, schrieb Sean Owen:
>> It still seems strange to observe such a bottleneck, I'm not sure
>> what's going on.
>> You are using an in-memory model like GenericDataModel?
>> We could look at ways to optimize that method, though it looks reasonably tight.
>> Where within that method do you see time spent?
>>
>> 2010/7/20 Young <wo...@126.com>:
>>   
>>> Hi again,
>>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>>> Or is there other way to make the online-recommendation much faster?
>>> Thank you.
>>>
>>>
>>>
>>>
>>>     
>>>> Yes you probably want a new, separate table. You have an extra step of
>>>> computing some notion of similarity anyway, and you probably want to
>>>> separate this table from your main data table anyhow for reasons of
>>>> performance and business logic separation.
>>>>
>>>> 2010/7/19 Young <wo...@126.com>:
>>>>       
>>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         
>>>>>> No, you need one table (or view if you like) containing all data. If
>>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>>
>>>>>> If you mean, can you use a table with preferences with a model that
>>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>>
>>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>>           
>>>>>>> Hi,
>>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>>
>>>>>>> Thank you
>>>>>>>             
>>>>>         
>>>     
>

Re: How to combine boolean datamodel with datamodel

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Young,

I guess you are using a jdbc-based DataModel.

If I understand getAllOtherItemIDs(...) correctly, it's doing the following:

for each item i preferred by the current user
  for each user u preferring i
    fetch other items preferred by u

This method could issue a lot of database queries if you encounter items
that are preferred by a lot of users. You could check whether you have
set the correct indexes on the database tables and generally check your
database connectivity (do the calls have to go over the network?), but I
think the best thing would be to use an in-memory data model, which
should not be a problem with 10M preferences.

--sebastian


Am 20.07.2010 20:04, schrieb Sean Owen:
> It still seems strange to observe such a bottleneck, I'm not sure
> what's going on.
> You are using an in-memory model like GenericDataModel?
> We could look at ways to optimize that method, though it looks reasonably tight.
> Where within that method do you see time spent?
>
> 2010/7/20 Young <wo...@126.com>:
>   
>> Hi again,
>> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
>> Or is there other way to make the online-recommendation much faster?
>> Thank you.
>>
>>
>>
>>
>>     
>>> Yes you probably want a new, separate table. You have an extra step of
>>> computing some notion of similarity anyway, and you probably want to
>>> separate this table from your main data table anyhow for reasons of
>>> performance and business logic separation.
>>>
>>> 2010/7/19 Young <wo...@126.com>:
>>>       
>>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>>> No, you need one table (or view if you like) containing all data. If
>>>>> you can't do this, you could write your own copy of a JDBCDataModel
>>>>> that can query multiple tables, or, that changes its SQL queries to
>>>>> use UNION statements. I imagine it will slow down a lot.
>>>>>
>>>>> If you mean, can you use a table with preferences with a model that
>>>>> ignores preferences, sure you can. The extra column is ignored.
>>>>>
>>>>> 2010/7/19 Young <wo...@126.com>:
>>>>>           
>>>>>> Hi,
>>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>>
>>>>>> Thank you
>>>>>>             
>>>>         
>>     


Re: Re: Re: How to combine boolean datamodel with datamodel

Posted by Sean Owen <sr...@gmail.com>.
It still seems strange to observe such a bottleneck, I'm not sure
what's going on.
You are using an in-memory model like GenericDataModel?
We could look at ways to optimize that method, though it looks reasonably tight.
Where within that method do you see time spent?

2010/7/20 Young <wo...@126.com>:
> Hi again,
> When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency?
> Or is there other way to make the online-recommendation much faster?
> Thank you.
>
>
>
>
>>Yes you probably want a new, separate table. You have an extra step of
>>computing some notion of similarity anyway, and you probably want to
>>separate this table from your main data table anyhow for reasons of
>>performance and business logic separation.
>>
>>2010/7/19 Young <wo...@126.com>:
>>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>>
>>>
>>>
>>>
>>>>No, you need one table (or view if you like) containing all data. If
>>>>you can't do this, you could write your own copy of a JDBCDataModel
>>>>that can query multiple tables, or, that changes its SQL queries to
>>>>use UNION statements. I imagine it will slow down a lot.
>>>>
>>>>If you mean, can you use a table with preferences with a model that
>>>>ignores preferences, sure you can. The extra column is ignored.
>>>>
>>>>2010/7/19 Young <wo...@126.com>:
>>>>> Hi,
>>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>>
>>>>> Thank you
>>>
>

Re:Re: Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
Hi again, 
When I do the itembased recommendation, I find there are some latency in getAllOtherItems(long userID). Because it is calculating the items' neighbors and merge these neighbors together. So I am thinking if I precompute each item's neighbors and store in the database, then when I getAllOtherItems(), I could merge these neighbors directly. Is this useful for reducing the latency? 
Or is there other way to make the online-recommendation much faster?
Thank you.




>Yes you probably want a new, separate table. You have an extra step of
>computing some notion of similarity anyway, and you probably want to
>separate this table from your main data table anyhow for reasons of
>performance and business logic separation.
>
>2010/7/19 Young <wo...@126.com>:
>> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
>> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>>
>>
>>
>>
>>>No, you need one table (or view if you like) containing all data. If
>>>you can't do this, you could write your own copy of a JDBCDataModel
>>>that can query multiple tables, or, that changes its SQL queries to
>>>use UNION statements. I imagine it will slow down a lot.
>>>
>>>If you mean, can you use a table with preferences with a model that
>>>ignores preferences, sure you can. The extra column is ignored.
>>>
>>>2010/7/19 Young <wo...@126.com>:
>>>> Hi,
>>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>>
>>>> Thank you
>>

Re: How to combine boolean datamodel with datamodel

Posted by Sean Owen <sr...@gmail.com>.
John can you define this a bit more? It's an interesting area and want
to be able to give more targeted responses if I can (same for others).
You are trying to evaluate how "good" your mapping of user behavior to
implicit rating is? Or something else.

On Wed, Dec 22, 2010 at 9:25 AM, John Hurliman <jh...@cull.tv> wrote:
> I am also interested in the problem of testing a model where implicit
> feedback is converted to explicit ratings. Please let me know if you
> find any research/work in this area.

Re: How to combine boolean datamodel with datamodel

Posted by John Hurliman <jh...@cull.tv>.
On Thu, Dec 16, 2010 at 8:55 PM, gabeweb <ga...@htc.com> wrote:
>
> Hi, I'm interested in following up on this question about combining boolean
> and non-boolean data.  Young and Sean mentioned two ways of doing this:
>
> (1) Assign each boolean data point an arbitrary rating, such as 4 out of 5.
>
> (2) Assume that all of the data is boolean, i.e. ignore the explicit
> ratings.
>
> Are there any other ways of doing this that folks have found to work well?
> I could imagine, for example, that instead of assigning each boolean data
> point an arbitrary rating (such as 4), one could assign it the average of
> the non-boolean ratings.
>
> The problem with any method of combining boolean and non-boolean data is
> that it can't be tested objectively, because it is a means of constructing a
> dataset -- including the test data!  So it seems that only a subjective
> evaluation could differentiate among different options.  I'm hoping that
> someone else has already done something along these lines.
>
> Thanks much in advance.
> --

I am also interested in the problem of testing a model where implicit
feedback is converted to explicit ratings. Please let me know if you
find any research/work in this area.

Re: How to combine boolean datamodel with datamodel

Posted by gabeweb <ga...@htc.com>.
Hi, I'm interested in following up on this question about combining boolean
and non-boolean data.  Young and Sean mentioned two ways of doing this:

(1) Assign each boolean data point an arbitrary rating, such as 4 out of 5.

(2) Assume that all of the data is boolean, i.e. ignore the explicit
ratings.

Are there any other ways of doing this that folks have found to work well? 
I could imagine, for example, that instead of assigning each boolean data
point an arbitrary rating (such as 4), one could assign it the average of
the non-boolean ratings.

The problem with any method of combining boolean and non-boolean data is
that it can't be tested objectively, because it is a means of constructing a
dataset -- including the test data!  So it seems that only a subjective
evaluation could differentiate among different options.  I'm hoping that
someone else has already done something along these lines.

Thanks much in advance.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/How-to-combine-boolean-datamodel-with-datamodel-tp977953p2102962.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Re: How to combine boolean datamodel with datamodel

Posted by Sean Owen <sr...@gmail.com>.
Yes you probably want a new, separate table. You have an extra step of
computing some notion of similarity anyway, and you probably want to
separate this table from your main data table anyhow for reasons of
performance and business logic separation.

2010/7/19 Young <wo...@126.com>:
> So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
> You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?
>
>
>
>
>>No, you need one table (or view if you like) containing all data. If
>>you can't do this, you could write your own copy of a JDBCDataModel
>>that can query multiple tables, or, that changes its SQL queries to
>>use UNION statements. I imagine it will slow down a lot.
>>
>>If you mean, can you use a table with preferences with a model that
>>ignores preferences, sure you can. The extra column is ignored.
>>
>>2010/7/19 Young <wo...@126.com>:
>>> Hi,
>>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>>
>>> Thank you
>

Re:Re: How to combine boolean datamodel with datamodel

Posted by Young <wo...@126.com>.
So my prpblem is that I want to build datamodel based on what user has bought or added to their favorite or rated.
You mean I need a table describe all these user behavior. For example, if user buys one item, I guess the user preference is 4 and add into this table?




>No, you need one table (or view if you like) containing all data. If
>you can't do this, you could write your own copy of a JDBCDataModel
>that can query multiple tables, or, that changes its SQL queries to
>use UNION statements. I imagine it will slow down a lot.
>
>If you mean, can you use a table with preferences with a model that
>ignores preferences, sure you can. The extra column is ignored.
>
>2010/7/19 Young <wo...@126.com>:
>> Hi,
>> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>>
>> Thank you

Re: How to combine boolean datamodel with datamodel

Posted by Sean Owen <sr...@gmail.com>.
No, you need one table (or view if you like) containing all data. If
you can't do this, you could write your own copy of a JDBCDataModel
that can query multiple tables, or, that changes its SQL queries to
use UNION statements. I imagine it will slow down a lot.

If you mean, can you use a table with preferences with a model that
ignores preferences, sure you can. The extra column is ignored.

2010/7/19 Young <wo...@126.com>:
> Hi,
> I have three tables, one is with preference and another two are without preference. Does mahout have some algorithm to integret these tables into one datamodel?
>
> Thank you