You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Thomas Rewig <tr...@mufin.com> on 2009/07/10 11:03:52 UTC

Memory and Speed Questions for Item-Based-Recommender

Hello Taste-Community,

since a few weeks I tested with mahout-taste (Release Apache Mahout 0.1) 
- and I like it :-)!

I have created a working Item-Based-Recommender and now I have some 
questions about speed and memory
... maybe you can give me a hint what I have to improve.

   1. ItemCorrelation: I precompute all correlations for approximately
      100000 items and save them in a MySqlDataBase if they correlate
      more than e.g. 0.95 . Then I get the correlation in the
      recommender in that way:

          //use the _precomputed_ ItemItemCorrelation

          String[] splittArray = *null*;
          String strLine = *null*;

          ItemItemSimilarity aItemItemCorrelation = *null*;
          Collection<GenericItemSimilarity.ItemItemSimilarity>
          correlationMatrix =
          *new* ArrayList<GenericItemSimilarity.ItemItemSimilarity>();

          // open File:

          BufferedReader inStream = *new* BufferedReader(*new*
          FileReader(filePath));

          *while*((strLine = inStream.readLine()) != *null*)
          {

              splittArray = strLine.split(",");

              Item aItem1 = *new* GenericItem<String>(splittArray[0]);
              Item aItem2 = *new* GenericItem<String>(splittArray[1]);

              aItemItemCorrelation = *new
              *GenericItemSimilarity.ItemItemSimilarity(aItem1, aItem2,
              Double./parseDouble/(splittArray[2]) );
              correlationMatrix.add(aItemItemCorrelation);

          }
          …
          // set the ItemSimilarity:

          * **this*.itemSimilarity = *new*
          GenericItemSimilarity(correlationMatrix);
          …
          // set Recommender:

          recommender = *new*
          GenericItemBasedRecommender(*super*.getModel(), itemSimilarity);
          …
          // set CachingRecommender:
          * this*.cachingRecommender = *new*
          CachingRecommender(recommender);

      Question 1:
      The similarity-matrix uses 400MB memory at the MySQLDB - by
      setting the ItemCorrelation 8GB Ram will be used to load the
      similarity-matrix as a GenericItemSimilarity. Is it
      possible/plausible that this matix uses more than 20 times more
      memory in RAM then in the Database - or have I do something wrong ?

      Question 2:
      How can I reduce the memory consumption from the
      GenericItemSimilarity? - |*GenericItemSimilarity
      <http://lucene.apache.org/mahout/javadoc/core/org/apache/mahout/cf/taste/impl/similarity/GenericItemSimilarity.html#GenericItemSimilarity%28java.lang.Iterable,%20int%29>*(Iterable
      <http://java.sun.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true><GenericItemSimilarity.ItemItemSimilarity
      <http://lucene.apache.org/mahout/javadoc/core/org/apache/mahout/cf/taste/impl/similarity/GenericItemSimilarity.ItemItemSimilarity.html>> similarities,
      int maxToKeep)|
      does't work, because if maxToKeep is too small, the
      recommendations will be bad ...


   2. Speed of Recommendation: I use a MySQLJDBCDataModel - MyISAM.
      Primary Key and Indexes are set:
      PRIMARY KEY (user_id, item_id),INDEX (user_id),INDEX (item_id). A
      Recommendation for a User takes between 0,5 and 80 seconds - I
      would like if this takes just 300ms.

    By the way I use a Quadcore 3,2 GHz with 32G-RAM to compute the
    recommendations, so maybe the DB is the Bottleneck. But if I use a
    FileDataModel it is faster, but not really much.

    Heres a log for a User with 2000 belonging Items:

    INFO  CollaborativeModel - Seconds to set ItemCorrelation: 76.187506 s
    INFO  CollaborativeModel - Seconds to set Recommender:
    0.025945000000000003 s
    INFO  CollaborativeModel - Seconds to set CachingRecommender: 0.06511 s
    INFO  CollaborativeController - SECONDS TO REFRESH THE SYSTEM:
    6.450000000000001E-4 s
    INFO  root - SECONDS TO GET A RECOMMENDATION FOR USER: 50.888347 s

    Question:
    Is there a way to increase the speed of a recommendation? (use
    InnoDB?, compute less Items ... someway ;-)...?)

So if you have some idea how I could reduce the memory consumption and 
increase the recommendation speed I would be very thankfully.

best regards
Thomas

--
___________________________________________________________
Thomas Rewig
___________________________________________________________

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Thomas Rewig <tr...@mufin.com>.
Sean Owen schrieb:
> On Tue, Jul 14, 2009 at 10:58 AM, Thomas Rewig<tr...@mufin.com> wrote:
>   
>> Because I need a UserSimilarity to precompute. Maybe I overlook a importent
>> detail, but if I do it in that way to compute all on the fly in a item-based
>> way:
>>
>>       *** "Pre"-Recommender ***
>>       // set the model for the ItemItemMatrix
>>      this.similarityModel = new MySQLJDBCDataModel(cPoolDS,
>> preferenceTableSim, userIDColumnSim, itemIDColumnSim, preferenceColumnSim);
>>       // set the "ItemSimilarity" for the ItemItemMatrix
>>      this.similarityItemSimilarity = new
>> EuclideanDistanceSimilarity(this.similarityModel);
>>       // set CaschingSimilarity
>>      this.cachingItemSimilarity = new
>> CachingItemSimilarity(this.similarityItemSimilarity, this.similarityModel);
>>       *** Recommender ***
>>      // set the model for the Recommender
>>      this.model = new MySQLJDBCDataModel(cPoolDS, preferenceTable,
>> userIDColumn, itemIDColumn, preferenceColumn);
>>      // set the Recommender with the *cachingItemSimilarity*
>>      this.recommender = new GenericUserBasedRecommender(this.model,
>> this.cachingItemSimilarity);
>>     
>
> But why are you using a user-based recommender here? I thought you
> were using an item-based recommender, in the end, to produce actual
> recommendations. Yes, of course you do not plug in an item-similarity
> metric into a user-based recommender.
>
> User GenericItemBasedRecommender.
>   
Oh I think I need a coffee ... in my code it is a 
GenericItemBasedRecommender, I have to improve my copy&paste ...

this.recommender = new GenericItemBasedRecommender(this.model,this.cachingItemSimilarity);


Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
On Tue, Jul 14, 2009 at 10:58 AM, Thomas Rewig<tr...@mufin.com> wrote:
> Because I need a UserSimilarity to precompute. Maybe I overlook a importent
> detail, but if I do it in that way to compute all on the fly in a item-based
> way:
>
>       *** "Pre"-Recommender ***
>       // set the model for the ItemItemMatrix
>      this.similarityModel = new MySQLJDBCDataModel(cPoolDS,
> preferenceTableSim, userIDColumnSim, itemIDColumnSim, preferenceColumnSim);
>       // set the "ItemSimilarity" for the ItemItemMatrix
>      this.similarityItemSimilarity = new
> EuclideanDistanceSimilarity(this.similarityModel);
>       // set CaschingSimilarity
>      this.cachingItemSimilarity = new
> CachingItemSimilarity(this.similarityItemSimilarity, this.similarityModel);
>       *** Recommender ***
>      // set the model for the Recommender
>      this.model = new MySQLJDBCDataModel(cPoolDS, preferenceTable,
> userIDColumn, itemIDColumn, preferenceColumn);
>      // set the Recommender with the *cachingItemSimilarity*
>      this.recommender = new GenericUserBasedRecommender(this.model,
> this.cachingItemSimilarity);

But why are you using a user-based recommender here? I thought you
were using an item-based recommender, in the end, to produce actual
recommendations. Yes, of course you do not plug in an item-similarity
metric into a user-based recommender.

User GenericItemBasedRecommender.

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Thomas Rewig <tr...@mufin.com>.
Sean Owen schrieb:
> On Mon, Jul 13, 2009 at 2:48 PM, Thomas Rewig<tr...@mufin.com> wrote:
>   
>>        *** "Pre"-Recommender ***
>>       // set the model for the ItemItemMatrix
>>       this.similarityModel = new MySQLJDBCDataModel(cPoolDS,
>> preferenceTable, userIDColumn, itemIDColumn, preferenceColumn);
>>     
>
>   
>>       // set the model for the Recommender
>>       this.model = new MySQLJDBCDataModel(cPoolDS, preferenceTable,
>> userIDColumn, itemIDColumn, preferenceColumn);
>>     
>
> Hmm, this doesn't look right. You are using the exact same table and
> columns in both cases. I thought that above, your "users" were items
> and "items" were item attributes, and in the second case, your "users"
> are actual users and "items" are items.
>
>   
Oh sorry, I just copy the code snippets, actually thats all several 
methods controlled by a controller ... the tables and columns are not 
the same, I just haven't been attentive and adjust this for the example. 
My fault sorry.

>>           // Cast to itemSimilarity because the "Users" in the
>> "Pre"-Recommender are Items
>>       this.cachingItemSimilarity = (CachingItemSimilarity)
>> this.cachingUserSimilarity;
>>     
>
> I don't understand why you are trying to do this? there are two
> classes available, CachingItemSimilarity and CachingUserSimilarity.
> There should be no need to force one to be the other.
Because I need a UserSimilarity to precompute. Maybe I overlook a 
importent detail, but if I do it in that way to compute all on the fly 
in a item-based way:

        *** "Pre"-Recommender ***
        // set the model for the ItemItemMatrix
       this.similarityModel = new MySQLJDBCDataModel(cPoolDS, 
preferenceTableSim, userIDColumnSim, itemIDColumnSim, preferenceColumnSim);
        // set the "ItemSimilarity" for the ItemItemMatrix
       this.similarityItemSimilarity = new 
EuclideanDistanceSimilarity(this.similarityModel);
        // set CaschingSimilarity
       this.cachingItemSimilarity = new 
CachingItemSimilarity(this.similarityItemSimilarity, this.similarityModel);
        *** Recommender ***
       // set the model for the Recommender
       this.model = new MySQLJDBCDataModel(cPoolDS, preferenceTable, 
userIDColumn, itemIDColumn, preferenceColumn);
       // set the Recommender with the *cachingItemSimilarity*
       this.recommender = new GenericUserBasedRecommender(this.model, 
this.cachingItemSimilarity);

Here I get a recommender for the users based at the 
*item-attributes*-similarity from the pre-recommender-model and that 
won't make sense because I need the *item*-similarity from the 
pre-recommender-model. The item-similarity of the pre-recommender I can 
only compute with a UserSimilarity, but I can't get a UserSimilarity in 
a CaschingItemSimilarity respectively the ItemBasedRecommender. Where is 
my mistake?

Anyway I can precompute with taste the item-item-matrix in a first step. 
And in a second step I can recommend with that matrix and the userdata 
other items to a user. To do that all "on the fly" isn't so importent at 
the moment.

I think I will test a little bit with lucene as Ted suggested to 
increase the speed of step two. That sounds very interesting.

Thomas


Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
On Mon, Jul 13, 2009 at 2:48 PM, Thomas Rewig<tr...@mufin.com> wrote:
>        *** "Pre"-Recommender ***
>       // set the model for the ItemItemMatrix
>       this.similarityModel = new MySQLJDBCDataModel(cPoolDS,
> preferenceTable, userIDColumn, itemIDColumn, preferenceColumn);

>       // set the model for the Recommender
>       this.model = new MySQLJDBCDataModel(cPoolDS, preferenceTable,
> userIDColumn, itemIDColumn, preferenceColumn);

Hmm, this doesn't look right. You are using the exact same table and
columns in both cases. I thought that above, your "users" were items
and "items" were item attributes, and in the second case, your "users"
are actual users and "items" are items.


>           // Cast to itemSimilarity because the "Users" in the
> "Pre"-Recommender are Items
>       this.cachingItemSimilarity = (CachingItemSimilarity)
> this.cachingUserSimilarity;

I don't understand why you are trying to do this? there are two
classes available, CachingItemSimilarity and CachingUserSimilarity.
There should be no need to force one to be the other.


>   -->I know, thats not working but in that way I could compute the
> ItemSimilarity on the fly (as a UserSimilarity) and must not precompute all
> Item-Item-Correlations, export them and import them as
> GenericItemSimilarity.ItemItemSimilarity. ( Ok with your new class I only
> have to compute and export them - the DB makes the rest)
> I hope you understand my idea - also maybe there is a better way to do that,
> and I even don't see it, because I'm in the moment not completely familiar
> with the taste-code.

If you do not want to precompute item-item similarities, then simply
don't! You do not have to. Use an ItemSimilarity implementation
directly. What is the problem with that?

Sean

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Thomas Rewig <tr...@mufin.com>.
Ok maybe I should talk in code to clarify my problem:

### Thats what I do: ###

    First I set the Item-Item-Matrix in a UserBased Way - here the 
Item-Item Correlations will be computed "on the fly"
   
       *** "Pre"-Recommender ***
        // set the model for the ItemItemMatrix
        this.similarityModel = new MySQLJDBCDataModel(cPoolDS, 
preferenceTable, userIDColumn, itemIDColumn, preferenceColumn);
       
        // set the "UserSimilarity" for the ItemItemMatrix
        this.similarityUserSimilarity = new 
EuclideanDistanceSimilarity(this.similarityModel);
       
        // set CaschingSimilarity
        this.cachingUserSimilarity = new 
CachingUserSimilarity(this.similarityUserSimilarity, this.similarityModel);
       
       *** Recommender ***
        // set the model for the Recommender
        this.model = new MySQLJDBCDataModel(cPoolDS, preferenceTable, 
userIDColumn, itemIDColumn, preferenceColumn);
      
        // set the userSimilarity for the Neighborhood
        this.userSimilarity = new PearsonCorrelationSimilarity( 
this.model );
       
        //set Neighborhood
        this.neighborhood = new NearestNUserNeighborhood( 
numNeighborhood, this.userSimilarity, this.model);
       
        // set the Recommender with the *cachingUserSimilarity*
        this.recommender = new GenericUserBasedRecommender(this.model, 
this.neighborhood, this.cachingUserSimilarity);

       --> I think I mixed here the UserSimilarity for the neighborhood 
in the Recommender (based at Users)  - and the UserSimilarity in the 
Pre-Recommender (based at Items) - and put both in the Recommender, the 
result is chaos in the recommendation, because the UserBasedRecommender 
doesn't work in that way

but
### What I want to do is: ###

        *** "Pre"-Recommender ***
        // set the model for the ItemItemMatrix
        this.similarityModel = new MySQLJDBCDataModel(cPoolDS, 
preferenceTable, userIDColumn, itemIDColumn, preferenceColumn);
       
        // set the "UserSimilarity" for the ItemItemMatrix
        this.similarityUserSimilarity = new 
EuclideanDistanceSimilarity(this.similarityModel);
       
        // set CaschingSimilarity
        this.cachingUserSimilarity = new 
CachingUserSimilarity(this.similarityUserSimilarity, this.similarityModel);
       
       *** Recommender ***
        // set the model for the Recommender
        this.model = new MySQLJDBCDataModel(cPoolDS, preferenceTable, 
userIDColumn, itemIDColumn, preferenceColumn);
      
        // Cast to itemSimilarity because the "Users" in the 
"Pre"-Recommender are Items
        this.cachingItemSimilarity = (CachingItemSimilarity) 
this.cachingUserSimilarity;
       
        // set the Recommender with the *cachingUserSimilarity*
        this.recommender = new GenericItemBasedRecommender(this.model, 
this.cachingItemSimilarity);

    -->I know, thats not working but in that way I could compute the 
ItemSimilarity on the fly (as a UserSimilarity) and must not precompute 
all Item-Item-Correlations, export them and import them as 
GenericItemSimilarity.ItemItemSimilarity. ( Ok with your new class I 
only have to compute and export them - the DB makes the rest)
I hope you understand my idea - also maybe there is a better way to do 
that, and I even don't see it, because I'm in the moment not completely 
familiar with the taste-code.


Sean Owen schrieb:
> I'm getting a little confused so let me clarify. You compute item-item
> similarities using a *user-based* recommender actually. I understand
> that part. In this first phase you treat items as users and item
> features as items. Makes sense. Sounds like that is working
That's working fine and the results / the correlations are plausible
> Then we arrive at the real recommendation engine. You have tried an
> item-based and user-based recommender. My first reaction is that you
> should probably be using an item-based recommender, or else the
> item-item similarities you computed are not used at all right?
>   
Yes that's what I want.
> Or did I misunderstand, and you are still talking about computing the
> item-item similarity with a user-based versus item-based recommender.
>
> I am not sure what you mean about using a user similarity metric in an
> item-based recommender. No, that algorithm does not use any notion of
> user similarity -- only item similarity. Which you precompute.
>
> Moving data to a JDBC- or Lucene-backed data store won't change the
> quality of recommendations, it will just change the memory and speed
> characteristics.
>   
Yes I know and that is important for practical use - at the moment all 
are only tests.
> On Mon, Jul 13, 2009 at 12:19 PM, Thomas Rewig<tr...@mufin.com> wrote:
>   
>> The UserBasedRecommender don't work for me! In comparative with the
>> Itembased-Recommender the Recommendations of the Userbase-Recommender are
>> really bad. Only/mostly popular Items will be recommended.
>> I think this is because I had a lot of Items and a lot of Users - but the
>> most Users occupied only a few Items and the overlap of Users is sparse. Now
>> there are a few other Users which occupied a lot of Items and randomly own a
>> Item of the User the recommendation is made for - so Items were be
>> recommended that have nothing in common with the profile of the user.
>>
>> So I have to use a recommender which bases only on the Item-Item-Matrix and
>> the Userprofile from the User the Recommendation is made for: The
>> ItemBasedRecommender. But I compute my Item-Item-Matrix with a
>> UserSimilarity. Is there a way to get a UserSimilarity in a
>> Item-Based-Recommender? Cast it somehow - so that I could compute the
>> Item-Item-Similarity "on the fly" like I do this now?
>>
>> Otherwise I would precompute the whole Item-Item-Matrix like I do it before
>> and put the Data with Sean's brand new MySQLJDBCItemSimilarity in the
>>  ItemBased - System. (I will test this now). Thank you for that!
>>     
>
>   

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
I'm getting a little confused so let me clarify. You compute item-item
similarities using a *user-based* recommender actually. I understand
that part. In this first phase you treat items as users and item
features as items. Makes sense. Sounds like that is working.

Then we arrive at the real recommendation engine. You have tried an
item-based and user-based recommender. My first reaction is that you
should probably be using an item-based recommender, or else the
item-item similarities you computed are not used at all right?

Or did I misunderstand, and you are still talking about computing the
item-item similarity with a user-based versus item-based recommender.

I am not sure what you mean about using a user similarity metric in an
item-based recommender. No, that algorithm does not use any notion of
user similarity -- only item similarity. Which you precompute.

Moving data to a JDBC- or Lucene-backed data store won't change the
quality of recommendations, it will just change the memory and speed
characteristics.

On Mon, Jul 13, 2009 at 12:19 PM, Thomas Rewig<tr...@mufin.com> wrote:
> The UserBasedRecommender don't work for me! In comparative with the
> Itembased-Recommender the Recommendations of the Userbase-Recommender are
> really bad. Only/mostly popular Items will be recommended.
> I think this is because I had a lot of Items and a lot of Users - but the
> most Users occupied only a few Items and the overlap of Users is sparse. Now
> there are a few other Users which occupied a lot of Items and randomly own a
> Item of the User the recommendation is made for - so Items were be
> recommended that have nothing in common with the profile of the user.
>
> So I have to use a recommender which bases only on the Item-Item-Matrix and
> the Userprofile from the User the Recommendation is made for: The
> ItemBasedRecommender. But I compute my Item-Item-Matrix with a
> UserSimilarity. Is there a way to get a UserSimilarity in a
> Item-Based-Recommender? Cast it somehow - so that I could compute the
> Item-Item-Similarity "on the fly" like I do this now?
>
> Otherwise I would precompute the whole Item-Item-Matrix like I do it before
> and put the Data with Sean's brand new MySQLJDBCItemSimilarity in the
>  ItemBased - System. (I will test this now). Thank you for that!

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Thomas Rewig <tr...@mufin.com>.
>> So I must use the UserSimilarity objects and the UserBasedRecommender
>> although I would prefer the ItemBasedRecommender.
>>     
>
> OK so use CachingUserSimilarity then. There are versions for both.
>   
Hello,

today I rewrite the code of my recommender-system and use the 
CachingUserSimilarity and a UserBasedRecommender.

Of course the memory consumption is no longer significantly and there 
are no more 8 GB in Ram. Otherwise the speed of the recommender-system 
naturally slows down a little bit. But the first goal is achieved and 
with that memory usage the system could run at an application server. 
Later I have to think about the speed and why it is so slow - but first 
there is another problem:

The UserBasedRecommender don't work for me! In comparative with the 
Itembased-Recommender the Recommendations of the Userbase-Recommender 
are really bad. Only/mostly popular Items will be recommended.
I think this is because I had a lot of Items and a lot of Users - but 
the most Users occupied only a few Items and the overlap of Users is 
sparse. Now there are a few other Users which occupied a lot of Items 
and randomly own a Item of the User the recommendation is made for - so 
Items were be recommended that have nothing in common with the profile 
of the user.

So I have to use a recommender which bases only on the Item-Item-Matrix 
and the Userprofile from the User the Recommendation is made for: The 
ItemBasedRecommender. But I compute my Item-Item-Matrix with a 
UserSimilarity. Is there a way to get a UserSimilarity in a 
Item-Based-Recommender? Cast it somehow - so that I could compute the 
Item-Item-Similarity "on the fly" like I do this now?

Otherwise I would precompute the whole Item-Item-Matrix like I do it 
before and put the Data with Sean's brand new MySQLJDBCItemSimilarity in 
the  ItemBased - System. (I will test this now). Thank you for that!

Regards Thomas

-- 
___________________________________________________________
Thomas Rewig, MusicFinder Developer Team

mufin GmbH - Dresden office  email: trewig@mufin.com
August-Bebel-Str. 36         phone: +49 (0)351 / 4794 670
01219 Dresden                fax: +49 (0)351 / 4794 765
Germany                      www: http://business.mufin.com
___________________________________________________________ 


Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
On Fri, Jul 10, 2009 at 2:45 PM, Thomas Rewig<tr...@mufin.com> wrote:
> Because I precompute the Item-Similarity-Matrix with a UserBasedSimilarity
> and that DB-Table:
> || aItem || aItemCharacteristic || aItemValue ||  ... so

Oh I see, yes.

> If I use a CachingItemSimilarity I must use a ItemSimilarity:
>
>       aCorrelation = aItemSimilarity.itemSimilarity(item1, item2);
>
> This is in my example and my opinion the similarity between
> aItemCharacteristic1 and aItemCharacteristic2 and this isn't interesting for
> me.
> So I must use the UserSimilarity objects and the UserBasedRecommender
> although I would prefer the ItemBasedRecommender.

OK so use CachingUserSimilarity then. There are versions for both.


> Yes real-time is the dream :-) but I know this will be hard to reach. I
> first will follow your hints and if the worst-case recommendation will no
> longer be 80s I'm happy :-).

Yes 80 seconds is not reasonable.

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Thomas Rewig <tr...@mufin.com>.
> Yes all that is true. Precomputing is reasonable -- it's the storing
> it in memory that is difficult given the size. You could consider
> keeping the similarities in the database instead, and not loading into
> memory, if you are worried about memory. There is not an
> implementation that reads a database table but we could construct one.
>   
:-) Sounds good. Thanks, but I will first test your other suggestions 
and hints.
> I don't see how UserSimilarity objects come into this. You would not
> use one in an item-based recommender. There is a CachingItemSimilarity
> for ItemSimilarity classes.
>   
Because I precompute the Item-Similarity-Matrix with a 
UserBasedSimilarity and that DB-Table:
|| aItem || aItemCharacteristic || aItemValue ||  ... so

    * User = aItem
    * Item = aItemCharacteristic
    * Preference = aItemValue

I uses that method to get the correlation:

        aCorrelation = aUserSimilarity.userSimilarity(user1, user2);

This is in my example the similarity between aItem1 and aItem2

If I use a CachingItemSimilarity I must use a ItemSimilarity:

        aCorrelation = aItemSimilarity.itemSimilarity(item1, item2);

This is in my example and my opinion the similarity between 
aItemCharacteristic1 and aItemCharacteristic2 and this isn't interesting 
for me.
So I must use the UserSimilarity objects and the UserBasedRecommender 
although I would prefer the ItemBasedRecommender.

... hopefully my line of reasoning is not to confused ;-).
> What you are doing now is effectively pre-computing all similarities
> and caching them in memory, all of them, ahead of time. Using
> CachingItemSimilarity would simply do that for you, and would probably
> use a lot less memory since only pairs that are needed, and accessed
> frequently, will be put into memory. It won't be quite as fast, since
> it will still be re-computing similarities from time to time. But
> overall you will probably use far less memory for a small decrease in
> performance.
>   
Ok I will try this at first.
> Beyond that I could suggest more extreme modifications to the code.
> For example, if you are willing to dig into the code to experiment,
> you can try something like this: instead of considering every single
> item for recommendation every time, pre-compute some subset of items
> that are reasonably popular, and then in the code, only consider
> recommending these. It is not a great approach since you want to
> recommend obscure items sometimes, but could help.
>   
I will think about this, but in the moment I will try to recommend all 
Items. If it isn't fast enough and there is no other idea I will try this.
> You should also try using the very latest code from subversion. Just
> this week I have made some pretty good improvements to the JDBC code.
>
> Also, it sounds like you are trying to do real-time recommendations,
> like synchronously with a user request. This can be hard since it
> imposes such a tight time limit. Consider doing recommendations
> asynchronously if you can. For example, start computing
> recommendations when the user logs in, and maybe on the 2nd page view
> 5 seconds later, you are ready to recommend something.
>   
Yes real-time is the dream :-) but I know this will be hard to reach. I 
first will follow your hints and if the worst-case recommendation will 
no longer be 80s I'm happy :-).

best regards
Thomas

-- 
___________________________________________________________
Thomas Rewig
___________________________________________________________ 


Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
I would think an RDBMS would be more specialized for this task with a
properly indexed table rather than attempting to cast this as a document
indexing problem, but I base that on nothing at all except my own
speculation. Oh well at least there are several options now available.

On Jul 13, 2009 12:17 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

I think Ted's suggestion is you'll find Lucene will be _a lot faster_ for
this task as you don't need all the other trappings of a DB.

On Jul 13, 2009, at 4:36 AM, Sean Owen wrote: > How does Lucene go from
item-item links to recom...
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Jul 13, 2009 at 9:21 AM, Sean Owen <sr...@gmail.com> wrote:

> It would be interesting to see how it scales
> indeed.
>

It scales very well.  At Veoh we were serving about 400 queries per second
at one point.  This included searches and recommendations, but I think I
remember that one time more than half were recs.


> This doesn't include a notion of item ratings (well, maybe the
> "documents" can include the item tokens several times to indicate a
> stronger association) but that is not a necessary condition for good
> recommendations.
>

Actually it does.  That is in the off-line part.

But, as you likely know by now, I am an anti-fan of using ratings for
recommendations.  I think that the data is suspect and is generally about
two orders of magnitude smaller than other viewing data.  Given that it is
lower quality and vastly smaller, I see no utility in actually spending
thought on using that kind of data.  Often you can use that data for free,
but that is the only price I would pay.

This is not the same as saying you should not allow users to rate things and
share ratings and so on.  Users enjoy doing that.  I just think that the
data is next to useless compared to the alternatives.


> I think the equivalent in CF is a combination of 1)
> an item-based recommender and 2) the log-likelihood similarity metric.
>

Indeed.  And the lucene based recommender effectively uses (2) twice.  First
in the off-line reduction of data, second in the implicit weighting
performed by lucene.

It is also useful to note that it is a piece of cake to integrate various
search functions into this kind of architecture.  Thus, filtering
recommendations by some boolean constraint, or tainting them with a textual
query or recency preference is literally trivial.

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
Nice, well that is pretty much the definition of "item-based
collaborative filtering"! It would be interesting to see how it scales
indeed. This doesn't include a notion of item ratings (well, maybe the
"documents" can include the item tokens several times to indicate a
stronger association) but that is not a necessary condition for good
recommendations. I think the equivalent in CF is a combination of 1)
an item-based recommender and 2) the log-likelihood similarity metric.


On Mon, Jul 13, 2009 at 4:11 PM, Ted Dunning<te...@gmail.com> wrote:
> Also, Lucene automagically does weighting which is very, very similar to
> exactly what you want.
>
> To Sean's question, the trick is that Lucene can store a list of item-item
> links that were filtered by cooccurrence statistics to form a binary matrix
> of interesting links.  Then if you query with a user's recent history of
> items as a query, you get back a list of items formed by considering
> different items to be weighted according to rarity.
>
> The result is quite good, very fast.  The reasons are that Lucene *is*
> weighted matrix multiplication of just the right sort.  This is what I was
> going to talk about in detail at ApacheCon.

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Ted Dunning <te...@gmail.com>.
Also, Lucene automagically does weighting which is very, very similar to
exactly what you want.

To Sean's question, the trick is that Lucene can store a list of item-item
links that were filtered by cooccurrence statistics to form a binary matrix
of interesting links.  Then if you query with a user's recent history of
items as a query, you get back a list of items formed by considering
different items to be weighted according to rarity.

The result is quite good, very fast.  The reasons are that Lucene *is*
weighted matrix multiplication of just the right sort.  This is what I was
going to talk about in detail at ApacheCon.

On Mon, Jul 13, 2009 at 4:16 AM, Grant Ingersoll <gs...@apache.org>wrote:

> I think Ted's suggestion is you'll find Lucene will be _a lot faster_ for
> this task as you don't need all the other trappings of a DB.
>
>
>
> On Jul 13, 2009, at 4:36 AM, Sean Owen wrote:
>
>  How does Lucene go from item-item links to recommendations? I'm
>> missing where the notion of user ratings, or even users, come into
>> play, or the strength of the association.
>>
>> If the issue is really just storing the item-item links efficiently in
>> a way that isn't in memory, how about I cook up a JDBC-based
>> implementation? Seem more direct.
>>
>>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Grant Ingersoll <gs...@apache.org>.
I think Ted's suggestion is you'll find Lucene will be _a lot faster_  
for this task as you don't need all the other trappings of a DB.


On Jul 13, 2009, at 4:36 AM, Sean Owen wrote:

> How does Lucene go from item-item links to recommendations? I'm
> missing where the notion of user ratings, or even users, come into
> play, or the strength of the association.
>
> If the issue is really just storing the item-item links efficiently in
> a way that isn't in memory, how about I cook up a JDBC-based
> implementation? Seem more direct.
>
> On Fri, Jul 10, 2009 at 11:56 PM, Ted Dunning<te...@gmail.com>  
> wrote:
>> Yes.
>>
>> One gotcha is that you generally have to limit document size a bit  
>> to get
>> good performance.  This is not a big deal because document  
>> normalization
>> makes it hard for these documents to be retrieved in any case.   
>> Also, these
>> are typically not good second order recommendations.  First order
>> recommendations are the top-40 kinds of things and make poor  
>> recommendations
>> for a bunch of reasons.  Second order recommendations are those  
>> that are
>> based on your history.  They make much better recommendations.
>>
>> On Fri, Jul 10, 2009 at 3:50 PM, Jason Rutherglen <
>> jason.rutherglen@gmail.com> wrote:
>>
>>> Interesting. So we're creating the item-item matrix using one of  
>>> the Mahout
>>> algorithms (like Taste?), then dumping it into Lucene.
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
PS I checked in MySQLJDBCItemSimilarity which would let you store and
read these from a database. At least that solves the memory issue. You
can still put a caching wrapper on top of it too.

On Mon, Jul 13, 2009 at 9:36 AM, Sean Owen<sr...@gmail.com> wrote:
> How does Lucene go from item-item links to recommendations? I'm
> missing where the notion of user ratings, or even users, come into
> play, or the strength of the association.
>
> If the issue is really just storing the item-item links efficiently in
> a way that isn't in memory, how about I cook up a JDBC-based
> implementation? Seem more direct.

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
How does Lucene go from item-item links to recommendations? I'm
missing where the notion of user ratings, or even users, come into
play, or the strength of the association.

If the issue is really just storing the item-item links efficiently in
a way that isn't in memory, how about I cook up a JDBC-based
implementation? Seem more direct.

On Fri, Jul 10, 2009 at 11:56 PM, Ted Dunning<te...@gmail.com> wrote:
> Yes.
>
> One gotcha is that you generally have to limit document size a bit to get
> good performance.  This is not a big deal because document normalization
> makes it hard for these documents to be retrieved in any case.  Also, these
> are typically not good second order recommendations.  First order
> recommendations are the top-40 kinds of things and make poor recommendations
> for a bunch of reasons.  Second order recommendations are those that are
> based on your history.  They make much better recommendations.
>
> On Fri, Jul 10, 2009 at 3:50 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Interesting. So we're creating the item-item matrix using one of the Mahout
>> algorithms (like Taste?), then dumping it into Lucene.
>>
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Ted Dunning <te...@gmail.com>.
Yes.

One gotcha is that you generally have to limit document size a bit to get
good performance.  This is not a big deal because document normalization
makes it hard for these documents to be retrieved in any case.  Also, these
are typically not good second order recommendations.  First order
recommendations are the top-40 kinds of things and make poor recommendations
for a bunch of reasons.  Second order recommendations are those that are
based on your history.  They make much better recommendations.

On Fri, Jul 10, 2009 at 3:50 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Interesting. So we're creating the item-item matrix using one of the Mahout
> algorithms (like Taste?), then dumping it into Lucene.
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Pallavi Palleti <pa...@corp.aol.com>.
Hi Ted,

I would be interested to work on it. Kindly let me know what are the next steps.

Thanks
Pallavi
----- Original Message -----
From: "Ted Dunning" <te...@gmail.com>
To: mahout-user@lucene.apache.org
Sent: Saturday, July 11, 2009 4:22:54 AM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi
Subject: Re: Memory and Speed Questions for Item-Based-Recommender

Hmmm.... don't hold your breath.  I will see if I have some time.

Is there somebody who would like to meet for a hackathon on this?  I could
put in 4-6 hours on a weekend.   If somebody else would contribute the
polish time after that, we could probably have a very nice example out of
this.

On Fri, Jul 10, 2009 at 3:50 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

>  I don't have any
> experience with the item-item matrix part so working on an example will
> help
> me understand it better. Showing the Lucene part may help others who work
> along these lines.
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Jason Rutherglen <ja...@gmail.com>.
> I will see if I have some time. Ah, well thanks for thinking of working on
it, I actually meant it would be a good project for me to work on in order
to learn about an end-to-end recomendation system combined with Lucene, with
the (hopefully) successful end result being an example others can reference.
I probably need some upfront guidance in terms of what to use in Mahout to
produce the item-item matrix.

On Fri, Jul 10, 2009 at 3:52 PM, Ted Dunning <te...@gmail.com> wrote:

> Hmmm.... don't hold your breath.  I will see if I have some time.
>
> Is there somebody who would like to meet for a hackathon on this?  I could
> put in 4-6 hours on a weekend.   If somebody else would contribute the
> polish time after that, we could probably have a very nice example out of
> this.
>
> On Fri, Jul 10, 2009 at 3:50 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
> >  I don't have any
> > experience with the item-item matrix part so working on an example will
> > help
> > me understand it better. Showing the Lucene part may help others who work
> > along these lines.
> >
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Ted Dunning <te...@gmail.com>.
Hmmm.... don't hold your breath.  I will see if I have some time.

Is there somebody who would like to meet for a hackathon on this?  I could
put in 4-6 hours on a weekend.   If somebody else would contribute the
polish time after that, we could probably have a very nice example out of
this.

On Fri, Jul 10, 2009 at 3:50 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

>  I don't have any
> experience with the item-item matrix part so working on an example will
> help
> me understand it better. Showing the Lucene part may help others who work
> along these lines.
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Jason Rutherglen <ja...@gmail.com>.
Interesting. So we're creating the item-item matrix using one of the Mahout
algorithms (like Taste?), then dumping it into Lucene. I don't have any
experience with the item-item matrix part so working on an example will help
me understand it better. Showing the Lucene part may help others who work
along these lines.

On Fri, Jul 10, 2009 at 12:57 PM, Ted Dunning <te...@gmail.com> wrote:

> Don't think so.  Sean should comment definitively.
>
> It is actually very easy to do.  The output of the recommendation off-line
> process (in my case, statistical filtering of the coocurrence matrix, in
> other cases something different) is generally a sparse matrix of item-item
> links.  Each line of this sparse matrix can be considered a document in
> creating a Lucene index.  You will have to use a correct analyzer and a
> line
> by line document segmenter, but that is trivial.
>
> Then recommendation is a simple query step.
>
> You guys at Linked-in have nice ability to present Lucene results in
> real-time so the part after gettting the item-item matrix should be dead
> simple for you.
>
> On Fri, Jul 10, 2009 at 12:48 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
> > Is there an example of this (using Lucene to store item-item links in
> > Lucene) in Mahout?  Sounds interesting.
> >
> > On Fri, Jul 10, 2009 at 11:35 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Storing the item-item links in Lucene and forming a query with recent
> > > history is a pretty easy way to get real-time recommendations.  This
> can
> > > also get rid of the cache because standard measures applied to make
> > Lucene
> > > fast will work on this.
> > >
> >
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Ted Dunning <te...@gmail.com>.
Don't think so.  Sean should comment definitively.

It is actually very easy to do.  The output of the recommendation off-line
process (in my case, statistical filtering of the coocurrence matrix, in
other cases something different) is generally a sparse matrix of item-item
links.  Each line of this sparse matrix can be considered a document in
creating a Lucene index.  You will have to use a correct analyzer and a line
by line document segmenter, but that is trivial.

Then recommendation is a simple query step.

You guys at Linked-in have nice ability to present Lucene results in
real-time so the part after gettting the item-item matrix should be dead
simple for you.

On Fri, Jul 10, 2009 at 12:48 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Is there an example of this (using Lucene to store item-item links in
> Lucene) in Mahout?  Sounds interesting.
>
> On Fri, Jul 10, 2009 at 11:35 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Storing the item-item links in Lucene and forming a query with recent
> > history is a pretty easy way to get real-time recommendations.  This can
> > also get rid of the cache because standard measures applied to make
> Lucene
> > fast will work on this.
> >
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Jason Rutherglen <ja...@gmail.com>.
Is there an example of this (using Lucene to store item-item links in
Lucene) in Mahout?  Sounds interesting.

On Fri, Jul 10, 2009 at 11:35 AM, Ted Dunning <te...@gmail.com> wrote:

> Storing the item-item links in Lucene and forming a query with recent
> history is a pretty easy way to get real-time recommendations.  This can
> also get rid of the cache because standard measures applied to make Lucene
> fast will work on this.
>
> On Fri, Jul 10, 2009 at 5:34 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > Also, it sounds like you are trying to do real-time recommendations,
> > like synchronously with a user request. This can be hard since it
> > imposes such a tight time limit. Consider doing recommendations
> > asynchronously if you can. For example, start computing
> > recommendations when the user logs in, and maybe on the 2nd page view
> > 5 seconds later, you are ready to recommend something.
> >
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Ted Dunning <te...@gmail.com>.
Storing the item-item links in Lucene and forming a query with recent
history is a pretty easy way to get real-time recommendations.  This can
also get rid of the cache because standard measures applied to make Lucene
fast will work on this.

On Fri, Jul 10, 2009 at 5:34 AM, Sean Owen <sr...@gmail.com> wrote:

> Also, it sounds like you are trying to do real-time recommendations,
> like synchronously with a user request. This can be hard since it
> imposes such a tight time limit. Consider doing recommendations
> asynchronously if you can. For example, start computing
> recommendations when the user logs in, and maybe on the 2nd page view
> 5 seconds later, you are ready to recommend something.
>

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
On Fri, Jul 10, 2009 at 1:18 PM, Thomas Rewig<tr...@mufin.com> wrote:
> Ok I will test with the Casching(?)Similarity. If I understand you right,
> this will mean I
>
>   * create a DataModel_1 (MySQLDB) in this Way: aItem,
>     aItemCharacteristic, aItemValue (each aItem have 40
>     aItemCharacteristics)
>   * create a UserSimilarity so that I have the similarity of the
>     aItems (if I use ItemSimilarity I would get the similarity of the
>     aItemCharacteristic ... right?)
>   * create a CachingUserSimilarity and put DataModel_1 and the
>     UserSimilarity in there
>   * create a DataModel_2 (MySQLDB) in this Way:
>     aUser,aItem,aItemPreference
>   * create the Neighborhood
>   * create a UserBasedRecommender and put the Neighborhood, the
>     DataModel_2 and the CachingUserSimilarity in there
>   * create a CachingRecommender
>   * et voilà :-) I have a working memory sparing recommender
>
> But I can't do that with a Itembased-Recommender because I have no
> ItemCorrelation (because theSimilarity of aItemCharacteristic doesn't matter
> ), is that right? So the sentence in the docu: "So, item-based recommenders
> can use pre-computed similarity values in the computations, which make them
> much faster. For large data sets, item-based recommenders are more
> appropriate" doesn't work for me. Or

Yes all that is true. Precomputing is reasonable -- it's the storing
it in memory that is difficult given the size. You could consider
keeping the similarities in the database instead, and not loading into
memory, if you are worried about memory. There is not an
implementation that reads a database table but we could construct one.

I don't see how UserSimilarity objects come into this. You would not
use one in an item-based recommender. There is a CachingItemSimilarity
for ItemSimilarity classes.

What you are doing now is effectively pre-computing all similarities
and caching them in memory, all of them, ahead of time. Using
CachingItemSimilarity would simply do that for you, and would probably
use a lot less memory since only pairs that are needed, and accessed
frequently, will be put into memory. It won't be quite as fast, since
it will still be re-computing similarities from time to time. But
overall you will probably use far less memory for a small decrease in
performance.

Beyond that I could suggest more extreme modifications to the code.
For example, if you are willing to dig into the code to experiment,
you can try something like this: instead of considering every single
item for recommendation every time, pre-compute some subset of items
that are reasonably popular, and then in the code, only consider
recommending these. It is not a great approach since you want to
recommend obscure items sometimes, but could help.

You should also try using the very latest code from subversion. Just
this week I have made some pretty good improvements to the JDBC code.

Also, it sounds like you are trying to do real-time recommendations,
like synchronously with a user request. This can be hard since it
imposes such a tight time limit. Consider doing recommendations
asynchronously if you can. For example, start computing
recommendations when the user logs in, and maybe on the 2nd page view
5 seconds later, you are ready to recommend something.



> Yes I do, but every .recommend command is taste intern only a single thread.
> Is that right?

Yes internally there is no multi-threading. You would do it externally.

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Thomas Rewig <tr...@mufin.com>.
Thank you for your fast reply!

Sean Owen:
> On Fri, Jul 10, 2009 at 10:03 AM, Thomas Rewig<tr...@mufin.com> wrote:
>>     Question 1:
>>     The similarity-matrix uses 400MB memory at the MySQLDB - by
>>     setting the ItemCorrelation 8GB Ram will be used to load the
>>     similarity-matrix as a GenericItemSimilarity. Is it
>>     possible/plausible that this matix uses more than 20 times more
>>     memory in RAM then in the Database - or have I do something wrong ?
>
> I could believe this. 100,000 items means about 5,000,000,000
> item-item pairs are possible. Many are not kept, but seeing as each
> once requires 30 or so bytes of memory, I am not surprised that it
> could take 8GB.
>
> That's really a lot to keep in memory. I might suggest, instead, that
> you not pre-compute the similarities, but instead compute them as
> needed and cache (use CachingItemSimilarity). That way you are not
> spending so much memory on pairs that may never get used, but still
> get much of the speed improvement.
In the Moment to get the similarity-matrix I do that:

    * create a DataModel (MySQLDB) in this Way: aItem,
      aItemCharacteristic, aItemValue  (each aItem have 40
      aItemCharacteristics later there will be more)
    * set a UserSimilarity - Pearson or Euclidian
    * get in a multithreaded way all similarities: aCorrelation =
      aUserSimilarity.userSimilarity(user1, user2); - this is stressful
      for cpu, but in 4 hours it is done - not bad for n!/(n-2)!
      combinations ;-)
    * save them if they correlate more than 0.95
    * get it in the GenericItemSimilarity to use it in a
      ItemBasedRecommender


Ok I will test with the Casching(?)Similarity. If I understand you 
right, this will mean I

    * create a DataModel_1 (MySQLDB) in this Way: aItem,
      aItemCharacteristic, aItemValue (each aItem have 40
      aItemCharacteristics)
    * create a UserSimilarity so that I have the similarity of the
      aItems (if I use ItemSimilarity I would get the similarity of the
      aItemCharacteristic ... right?)
    * create a CachingUserSimilarity and put DataModel_1 and the
      UserSimilarity in there
    * create a DataModel_2 (MySQLDB) in this Way:
      aUser,aItem,aItemPreference
    * create the Neighborhood
    * create a UserBasedRecommender and put the Neighborhood, the
      DataModel_2 and the CachingUserSimilarity in there
    * create a CachingRecommender
    * et voilà :-) I have a working memory sparing recommender

But I can't do that with a Itembased-Recommender because I have no 
ItemCorrelation (because theSimilarity of aItemCharacteristic doesn't 
matter ), is that right? So the sentence in the docu: "So, item-based 
recommenders can use pre-computed similarity values in the computations, 
which make them much faster. For large data sets, item-based 
recommenders are more appropriate" doesn't work for me. Or

In the moment I have a Testset of 500000 Users and 100000 Items. The 
Item-Similarity is computed with taste, but with external data.

Sean Owen:
>>
>>   Question:
>>   Is there a way to increase the speed of a recommendation? (use
>>   InnoDB?, compute less Items ... someway ;-)...?)
>
> Your indexes are right. Are you using a connection pool? that is
> really important.
Yes I do use a connection pool:

        this.cPoolDS = new ConnectionPoolDataSource(dataSource);
        this.aConnection = cPoolDS.getConnection();

Sean Owen:
> How many users do you have? if you have relatively few users, you
> might use a user-based recommender instead. Or, consider a slope-one
> recommender.
In the Moment there are 5 times more users than items - later this could 
change to 1.5 Mio Items and 150,000 users but first my tests must work.
I testet the slope-one recommender as taste wasn't in mahout and I 
found, that the recommendations don't work for me. Has there something 
changed? ... maybe I should give it another try.

Sean Owen:
> It sounds like you have a lot of items, so the way item-based
> recommenders work, it will be slow.
>
> Using CachingItemSimilarity could help. I am surprised that a
> FileDataModel isn't much faster, since it loads data in memory. That
> suggests to me that the database isn't the bottleneck.
>
> Are you using multiple threads to compute recommendations
> simultaneously? you certainly can, to take advantage of the 4 cores.
Yes I do, but every .recommend command is taste intern only a single 
thread. Is that right?

best regards
Thomas
-- 
___________________________________________________________
Thomas Rewig
___________________________________________________________

Re: Memory and Speed Questions for Item-Based-Recommender

Posted by Sean Owen <sr...@gmail.com>.
On Fri, Jul 10, 2009 at 10:03 AM, Thomas Rewig<tr...@mufin.com> wrote:
>     Question 1:
>     The similarity-matrix uses 400MB memory at the MySQLDB - by
>     setting the ItemCorrelation 8GB Ram will be used to load the
>     similarity-matrix as a GenericItemSimilarity. Is it
>     possible/plausible that this matix uses more than 20 times more
>     memory in RAM then in the Database - or have I do something wrong ?

I could believe this. 100,000 items means about 5,000,000,000
item-item pairs are possible. Many are not kept, but seeing as each
once requires 30 or so bytes of memory, I am not surprised that it
could take 8GB.

That's really a lot to keep in memory. I might suggest, instead, that
you not pre-compute the similarities, but instead compute them as
needed and cache (use CachingItemSimilarity). That way you are not
spending so much memory on pairs that may never get used, but still
get much of the speed improvement.


>     Question 2:
>     How can I reduce the memory consumption from the
>     GenericItemSimilarity? - |*GenericItemSimilarity
>
> <http://lucene.apache.org/mahout/javadoc/core/org/apache/mahout/cf/taste/impl/similarity/GenericItemSimilarity.html#GenericItemSimilarity%28java.lang.Iterable,%20int%29>*(Iterable
>
> <http://java.sun.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true><GenericItemSimilarity.ItemItemSimilarity
>
> <http://lucene.apache.org/mahout/javadoc/core/org/apache/mahout/cf/taste/impl/similarity/GenericItemSimilarity.ItemItemSimilarity.html>>
> similarities,
>     int maxToKeep)|
>     does't work, because if maxToKeep is too small, the
>     recommendations will be bad ...

Yeah you are already filtering out many of the less important
correlations anyway. You could filter yet more, to reduce memory
requirements, but I think it's just best to not try to store all of
this in memory. It doesn't scale well.


>  2. Speed of Recommendation: I use a MySQLJDBCDataModel - MyISAM.
>     Primary Key and Indexes are set:
>     PRIMARY KEY (user_id, item_id),INDEX (user_id),INDEX (item_id). A
>     Recommendation for a User takes between 0,5 and 80 seconds - I
>     would like if this takes just 300ms.
>
>   By the way I use a Quadcore 3,2 GHz with 32G-RAM to compute the
>   recommendations, so maybe the DB is the Bottleneck. But if I use a
>   FileDataModel it is faster, but not really much.
>
>   Heres a log for a User with 2000 belonging Items:
>
>   INFO  CollaborativeModel - Seconds to set ItemCorrelation: 76.187506 s
>   INFO  CollaborativeModel - Seconds to set Recommender:
>   0.025945000000000003 s
>   INFO  CollaborativeModel - Seconds to set CachingRecommender: 0.06511 s
>   INFO  CollaborativeController - SECONDS TO REFRESH THE SYSTEM:
>   6.450000000000001E-4 s
>   INFO  root - SECONDS TO GET A RECOMMENDATION FOR USER: 50.888347 s
>
>   Question:
>   Is there a way to increase the speed of a recommendation? (use
>   InnoDB?, compute less Items ... someway ;-)...?)

Your indexes are right. Are you using a connection pool? that is
really important.

How many users do you have? if you have relatively few users, you
might use a user-based recommender instead. Or, consider a slope-one
recommender.
It sounds like you have a lot of items, so the way item-based
recommenders work, it will be slow.

Using CachingItemSimilarity could help. I am surprised that a
FileDataModel isn't much faster, since it loads data in memory. That
suggests to me that the database isn't the bottleneck.

Are you using multiple threads to compute recommendations
simultaneously? you certainly can, to take advantage of the 4 cores.