You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Najum Ali <na...@googlemail.com> on 2014/05/03 17:07:14 UTC

Re: Performance Issue using item-based approach!

Hi there, 

I mentioned a problem of using the ItemBasedRecommender. It is so much slower then using UserBasedRecommender. 

@Sebastian: You said limiting the precomputation file should work. For example: only 50 similarities for an Item.  You also said this feature is not included in the precomputation yet.
Although using the MultithreadedBatchItemSimilarities (Mahout 0.9), I saw that the Constructor accepts following arguments:

/**
   * @param recommender recommender to use
   * @param similarItemsPerItem number of similar items to compute per item
   */
  public MultithreadedBatchItemSimilarities(ItemBasedRecommender recommender, int similarItemsPerItem) {
    this(recommender, similarItemsPerItem, DEFAULT_BATCH_SIZE);
  }

And in fact, if I set 15 as similarItemsPerItem, the csv file contains only 15 similar items per item. Why do you said that this feature is not implemented yet, maybe you meant something else and I
understood something wrong. Therefore I am confused at bit. The problem is, also with limited pairs of similar items the user-based approach is much faster:

Using a file with 6040 users and 3706 items:
A UserbasedRecommender with k = 50 takes 331ms.

Itembased takes 1510 ms and with precomputed similarities it takes 836 .. still double as slow. Is there no possibility to restrict something like „neighborhood size“ in userbased?
I have also tried SamplingCandidateItemsStrategy with e.g 10 on each three first arguments .. and also tried using CachingSimilarity decorater, but nothing seems to help.

Find a attached java file for this test.

And yea, I am using the GroupLens Movie Data: 1M.

Can the dataset be the fault as Sebastian mentioned before:

>>>>>> In the movielens dataset this is true for almost all pairs of items,
>>>>>> unfortunately. From 3076 items, more than 11 million similarities are
>>>>>> created. A common approach for that (which is not yet implemented in
>>>>>> our
>>>>>> precomputation unfortunately) is to only retain the top-k similar items

Hope of getting some help of u guys.. this is getting very depressing :(

Regards
Najum Ali

Re: Performance Issue using item-based approach!

Posted by Ted Dunning <te...@gmail.com>.

Truer words than this were never said. 

Sent from my iPhone

> On May 9, 2014, at 8:36, Pat Ferrel <pa...@gmail.com> wrote:
> 
> let your data determine this, not example data.

Re: Performance Issue using item-based approach!

Posted by Pat Ferrel <pa...@gmail.com>.

Can we step back a bit, is speed of query the only issue? Why do you care how long it takes? This is example data, not yours. Some of the techniques you mention below are Hadoop mapreduce based approaches. These by their nature are batch oriented. The mapreduce item-based recommender may take hours to complete but it calculates all recs for all users. This is then expected to be put into some fast serving component like a database. So the lookup from the database would be just a column of item ids associated with a user. Very simple and super fast. This will scale to any size your DB can support. but requires Hadoop be installed. If you want fast queries you can’t get faster than precomputing them.

If you want to use the in-memory recommender because it allows you to ask for a specific user’s recommendations (bypassing the need for a DB) it will not scale as far. See if giving it more memory helps. Why do you need to use the item-based? It is not necessarily any better. Both still recommend items, use the fastest.

Also remember that your data may have completely different characteristics. Movielens is fine for experimentation but what is your data like? Will it even fit inside an in-memory recommender. How many users, items, and interactions (the matrix is usually very sparse)? The larger this is the more likely the in-memory version won’t work.

Ted’s suggestion for using ItemSimilarityJob to create an indicator matrix then indexing it with Solr and making queries of a user’s preferences against the indicators will produce recs in a few milliseconds. This also scales with Solr, which is known to scale quite well. This requires Hadoop and Solr but not necessarily a DB (though it would be nice).

For experimentation when you are not using your own data I’m not sure I see a problem here. There are many ways to make it faster but they may require using a different approach—let your data determine this, not example data.

On May 3, 2014, at 8:27 AM, Najum Ali <na...@googlemail.com> wrote:

(Resending mail without sending my digital signature)

Hi there, 

I mentioned a problem of using the ItemBasedRecommender. It is so much slower then using UserBasedRecommender. 

@Sebastian: You said limiting the precomputation file should work. For example: only 50 similarities for an Item.  You also said this feature is not included in the precomputation yet.
Although using the MultithreadedBatchItemSimilarities (Mahout 0.9), I saw that the Constructor accepts following arguments:

/**
 * @param recommender recommender to use
 * @param similarItemsPerItem number of similar items to compute per item
 */
public MultithreadedBatchItemSimilarities(ItemBasedRecommender recommender, int similarItemsPerItem) {
  this(recommender, similarItemsPerItem, DEFAULT_BATCH_SIZE);
}

And in fact, if I set 15 as similarItemsPerItem, the csv file contains only 15 similar items per item. Why do you said that this feature is not implemented yet, maybe you meant something else and I
understood something wrong. Therefore I am confused at bit. The problem is, also with limited pairs of similar items the user-based approach is much faster:

Using a file with 6040 users and 3706 items:
A UserbasedRecommender with k = 50 takes 331ms.

Itembased takes 1510 ms and with precomputed similarities it takes 836 .. still double as slow. Is there no possibility to restrict something like „neighborhood size“ in userbased?
I have also tried SamplingCandidateItemsStrategy with e.g 10 on each three first arguments .. and also tried using CachingSimilarity decorater, but nothing seems to help.

Find a attached java file for this test.

And yea, I am using the GroupLens Movie Data: 1M.

Can the dataset be the fault as Sebastian mentioned before:

>>>>>> In the movielens dataset this is true for almost all pairs of items,
>>>>>> unfortunately. From 3076 items, more than 11 million similarities are
>>>>>> created. A common approach for that (which is not yet implemented in
>>>>>> our
>>>>>> precomputation unfortunately) is to only retain the top-k similar items

Hope of getting some help of u guys.. this is getting very depressing :(

Regards
Najum Ali

Fwd: Performance Issue using item-based approach!

Posted by Najum Ali <na...@googlemail.com>.

(Resending mail without sending my digital signature)

Hi there, 

I mentioned a problem of using the ItemBasedRecommender. It is so much slower then using UserBasedRecommender. 

@Sebastian: You said limiting the precomputation file should work. For example: only 50 similarities for an Item.  You also said this feature is not included in the precomputation yet.
Although using the MultithreadedBatchItemSimilarities (Mahout 0.9), I saw that the Constructor accepts following arguments:

/**
  * @param recommender recommender to use
  * @param similarItemsPerItem number of similar items to compute per item
  */
 public MultithreadedBatchItemSimilarities(ItemBasedRecommender recommender, int similarItemsPerItem) {
   this(recommender, similarItemsPerItem, DEFAULT_BATCH_SIZE);
 }

And in fact, if I set 15 as similarItemsPerItem, the csv file contains only 15 similar items per item. Why do you said that this feature is not implemented yet, maybe you meant something else and I
understood something wrong. Therefore I am confused at bit. The problem is, also with limited pairs of similar items the user-based approach is much faster:

Using a file with 6040 users and 3706 items:
A UserbasedRecommender with k = 50 takes 331ms.

Itembased takes 1510 ms and with precomputed similarities it takes 836 .. still double as slow. Is there no possibility to restrict something like „neighborhood size“ in userbased?
I have also tried SamplingCandidateItemsStrategy with e.g 10 on each three first arguments .. and also tried using CachingSimilarity decorater, but nothing seems to help.

Find a attached java file for this test.

And yea, I am using the GroupLens Movie Data: 1M.

Can the dataset be the fault as Sebastian mentioned before:

>>>>>> In the movielens dataset this is true for almost all pairs of items,
>>>>>> unfortunately. From 3076 items, more than 11 million similarities are
>>>>>> created. A common approach for that (which is not yet implemented in
>>>>>> our
>>>>>> precomputation unfortunately) is to only retain the top-k similar items

Hope of getting some help of u guys.. this is getting very depressing :(

Regards
Najum Ali