You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Uwe Reimann <uw...@junocon.de> on 2011/05/23 11:40:27 UTC

Lucene for UserSimilarity

Hi,

I'm currently integrating mahout's recommendation engine into a site.

I'm not quite clear what DataModel to use. PostgresJdbcDataModel looks 
handy, but seem to produce way to many queries. ReloadFromJDBCDataModel 
seems to address that problem but still needs to calculate the 
similarity of a given user to every other user in the system.

Would it be possible and performant to use lucene to perform the search 
for the top n most similar users, provided an index exists where the 
user id is the document id and the preferences of the users are the term 
vectors?

If that's possible, would it additionally be possible to use negative 
values in the term vector for recording dislikes of the user?
Best regards,

Uwe

Re: Lucene for UserSimilarity

Posted by Sean Owen <sr...@gmail.com>.

OK that works though LogLikelihoodSimilarity would probably give better
results.
On May 24, 2011 12:17 PM, "Uwe Reimann" <uw...@junocon.de> wrote:
> Am 24.05.2011 12:36, schrieb Sean Owen:
>> On Tue, May 24, 2011 at 11:31 AM, Uwe Reimann<uw...@junocon.de>
wrote:
>>
>>> I did some testing of the different recommenders on a real data set from
a
>>> bookmarking site. GenericBooleanPrefItemBasedRecommender did not work
very
>>> well for me. It seemed to recommend the top links. Using
>>> GenericUserBasedRecommender worked way better (after some tweaking),
which
>>> recommended links that actually fit my interests. Might need to do some
more
>>> testing here.
>> Were you using "compatible" similarity implementations? Pearson is
>> meaningless on boolean data and you would get poor results.
> I was using TanimotoCoefficientSimilarity (for both,
> GenericBooleanPrefUserBasedRecommender and
> GenericBooleanPrefItemBasedRecommender).
>
>

Re: Lucene for UserSimilarity

Posted by Uwe Reimann <uw...@junocon.de>.

Am 24.05.2011 12:36, schrieb Sean Owen:
> On Tue, May 24, 2011 at 11:31 AM, Uwe Reimann<uw...@junocon.de>  wrote:
>
>> I did some testing of the different recommenders on a real data set from a
>> bookmarking site. GenericBooleanPrefItemBasedRecommender did not work very
>> well for me. It seemed to recommend the top links. Using
>> GenericUserBasedRecommender worked way better (after some tweaking), which
>> recommended links that actually fit my interests. Might need to do some more
>> testing here.
> Were you using "compatible" similarity implementations? Pearson is
> meaningless on boolean data and you would get poor results.
I was using TanimotoCoefficientSimilarity (for both, 
GenericBooleanPrefUserBasedRecommender and 
GenericBooleanPrefItemBasedRecommender).

Re: Lucene for UserSimilarity

Posted by Sean Owen <sr...@gmail.com>.

On Tue, May 24, 2011 at 11:31 AM, Uwe Reimann <uw...@junocon.de> wrote:
> Probably depends on how many data point were available before. I suspect
> i.e. the 5th data point having a greater impact than the 105th. Is there a
> lower limit (above 1) on the number of data points a user must have before
> recommendations make sense?

Right. There's not one answer to that. A few data points can be
meaningful enough to make recommendations from though more is
generally better.

> I did some testing of the different recommenders on a real data set from a
> bookmarking site. GenericBooleanPrefItemBasedRecommender did not work very
> well for me. It seemed to recommend the top links. Using
> GenericUserBasedRecommender worked way better (after some tweaking), which
> recommended links that actually fit my interests. Might need to do some more
> testing here.

Were you using "compatible" similarity implementations? Pearson is
meaningless on boolean data and you would get poor results.

Or -- there is GenericItemBasedRecommender, which does use ratings,
and Pearson is fine with this implementation.

> (1) would include categories, that should not be recommended, that's why (2)
> is being used to pick the recommendations from. (2) would contain the liked
> items of every user, that includes items that are disliked by other users.
> (3) is for filtering out items that the user has not rated, but has been
> presented before.

I see. Yes it's entirely possible to compute user-user or item-item
similarity on one model, and then apply those similarities to a
recommender based on another model.

(3) doesn't need a DataModel per se, but yes needs access to some list
of previously-seen items in some form. up to you.

Re: Lucene for UserSimilarity

Posted by Uwe Reimann <uw...@junocon.de>.

Am 24.05.2011 12:31, schrieb Uwe Reimann:
> Am 24.05.2011 11:39, schrieb Sean Owen:
>
>> So you can probably get away with perioidically recomputing these,
>> perhaps frequently, but not necessarily at every update.
> I could trigger the recalculation if the knowledge about the current 
> user has changed by say 25%. That way the recomputing rate would 
> decrease.
Same would apply to clusters: those could be recalculated if the total 
knowledge as increased by say 25%.

Re: Lucene for UserSimilarity

Posted by Uwe Reimann <uw...@junocon.de>.

Am 24.05.2011 11:39, schrieb Sean Owen:
> On Tue, May 24, 2011 at 10:17 AM, Uwe Reimann<uw...@junocon.de>  wrote:
>> Since the user provides new preferences at a high rate, I expect to change
>> the neighborhood of an individual user rapidly. Using CachingUserSimilarity
>> or CachingUserNeighborhood probably won't work here. Using a
>> ClusteringRecommender seems to be an option here in order to search against
>> some clusters instead against many users. The cluster should be recalculated
>> periodically in the background.
> (You can have the cache clear just entries for the current user.)
>
> Neighborhoods ought to be stable-ish. I would not expect that one new
> data point would significantly change who your most similar users are.
Probably depends on how many data point were available before. I suspect 
i.e. the 5th data point having a greater impact than the 105th. Is there 
a lower limit (above 1) on the number of data points a user must have 
before recommendations make sense?

> So you can probably get away with perioidically recomputing these,
> perhaps frequently, but not necessarily at every update.
I could trigger the recalculation if the knowledge about the current 
user has changed by say 25%. That way the recomputing rate would decrease.
> You do need to use the latest preferences in recommendation, of
> course, but that's separate from calculating a neighborhood.
>
>
>> Dislikes should be considered during similarity search. I'd like to express
>> those as negative preference values. PearsonCorrelationSimilarity should be
>> ok with that, right?
> Yes.
>
>
>> Since I expect to have very low overlap in items between (especially new)
>> users, I'd like to take the item's category into account during similarity
>> search. User u1, who likes items i1 of category c1 should get item i2 of
>> category c1 recommended if user u2 likes that. Both users would have a
>> preference value for category c1 in common. This should clearly be possible
>> by just providing the calculated preference values for the category items.
> You are describing more of an item-based recommender and indeed I
> think that could be better here since it avoids cold-start problems
> better. (I prefer it as well.) You might instead look at
> GenericItemBasedRecommender and ItemSImilarity instead.
I did some testing of the different recommenders on a real data set from 
a bookmarking site. GenericBooleanPrefItemBasedRecommender did not work 
very well for me. It seemed to recommend the top links. Using 
GenericUserBasedRecommender worked way better (after some tweaking), 
which recommended links that actually fit my interests. Might need to do 
some more testing here.

> Your thinking about using Lucene almost surely also applies to
> item-item similarity.
>
>
>> I think I need to provide different DataModels to the different stages of
>> recommendation calculation: 1) one which includes likes and dislike for
>> items and categories for similarity search, 2) one which includes just the
>> liked items to pick the recommendations from and 3) one which includes all
>> items of a user (liked, disliked and skipped ones) for filtering out the
>> user's items using an IDRescorer.
> I think one DataModel is fine. You want to include all data in
> similarity calculations (1). It is also good to have all items
> available in recommendation (2); you don't want to exclude an item
> just because someone didn't like it. And in (3) you do not need to
> filter out items the user has rated; that's done already.
(1) would include categories, that should not be recommended, that's why 
(2) is being used to pick the recommendations from. (2) would contain 
the liked items of every user, that includes items that are disliked by 
other users. (3) is for filtering out items that the user has not rated, 
but has been presented before.

Re: Lucene for UserSimilarity

Posted by Sean Owen <sr...@gmail.com>.

On Tue, May 24, 2011 at 10:17 AM, Uwe Reimann <uw...@junocon.de> wrote:
> Since the user provides new preferences at a high rate, I expect to change
> the neighborhood of an individual user rapidly. Using CachingUserSimilarity
> or CachingUserNeighborhood probably won't work here. Using a
> ClusteringRecommender seems to be an option here in order to search against
> some clusters instead against many users. The cluster should be recalculated
> periodically in the background.

(You can have the cache clear just entries for the current user.)

Neighborhoods ought to be stable-ish. I would not expect that one new
data point would significantly change who your most similar users are.
So you can probably get away with perioidically recomputing these,
perhaps frequently, but not necessarily at every update.

You do need to use the latest preferences in recommendation, of
course, but that's separate from calculating a neighborhood.

> Dislikes should be considered during similarity search. I'd like to express
> those as negative preference values. PearsonCorrelationSimilarity should be
> ok with that, right?

Yes.

> Since I expect to have very low overlap in items between (especially new)
> users, I'd like to take the item's category into account during similarity
> search. User u1, who likes items i1 of category c1 should get item i2 of
> category c1 recommended if user u2 likes that. Both users would have a
> preference value for category c1 in common. This should clearly be possible
> by just providing the calculated preference values for the category items.

You are describing more of an item-based recommender and indeed I
think that could be better here since it avoids cold-start problems
better. (I prefer it as well.) You might instead look at
GenericItemBasedRecommender and ItemSImilarity instead.

Your thinking about using Lucene almost surely also applies to
item-item similarity.

> I think I need to provide different DataModels to the different stages of
> recommendation calculation: 1) one which includes likes and dislike for
> items and categories for similarity search, 2) one which includes just the
> liked items to pick the recommendations from and 3) one which includes all
> items of a user (liked, disliked and skipped ones) for filtering out the
> user's items using an IDRescorer.

I think one DataModel is fine. You want to include all data in
similarity calculations (1). It is also good to have all items
available in recommendation (2); you don't want to exclude an item
just because someone didn't like it. And in (3) you do not need to
filter out items the user has rated; that's done already.

Re: Lucene for UserSimilarity

Posted by Ted Dunning <te...@gmail.com>.

You are correct.  It can be used that way.

I have also used it in the past to store recommendations data that was
computed off-line.  As volume increased we eventually had to move
to alternative systems, but Lucene worked well for a long time.

On Tue, May 24, 2011 at 2:17 AM, Uwe Reimann <uw...@junocon.de> wrote:

> I've been using lucene before and I know it performs quite well. For my use
> case it was fast enough to do instant search. I think the way lucene works
> is pretty close to finding a user neighborhood in mahout, so leveraging
> lucene's power for that came to my mind. I might be wrong here.

Re: Lucene for UserSimilarity

Posted by Sean Owen <sr...@gmail.com>.

On Tue, May 24, 2011 at 10:17 AM, Uwe Reimann <uw...@junocon.de> wrote:
> Since the user provides new preferences at a high rate, I expect to change
> the neighborhood of an individual user rapidly. Using CachingUserSimilarity
> or CachingUserNeighborhood probably won't work here. Using a
> ClusteringRecommender seems to be an option here in order to search against
> some clusters instead against many users. The cluster should be recalculated
> periodically in the background.

(You can have the cache clear just entries for the current user.)

Neighborhoods ought to be stable-ish. I would not expect that one new
data point would significantly change who your most similar users are.
So you can probably get away with perioidically recomputing these,
perhaps frequently, but not necessarily at every update.

You do need to use the latest preferences in recommendation, of
course, but that's separate from calculating a neighborhood.

> Dislikes should be considered during similarity search. I'd like to express
> those as negative preference values. PearsonCorrelationSimilarity should be
> ok with that, right?

Yes.

> Since I expect to have very low overlap in items between (especially new)
> users, I'd like to take the item's category into account during similarity
> search. User u1, who likes items i1 of category c1 should get item i2 of
> category c1 recommended if user u2 likes that. Both users would have a
> preference value for category c1 in common. This should clearly be possible
> by just providing the calculated preference values for the category items.

You are describing more of an item-based recommender and indeed I
think that could be better here since it avoids cold-start problems
better. (I prefer it as well.) You might instead look at
GenericItemBasedRecommender and ItemSImilarity instead.

Your thinking about using Lucene almost surely also applies to
item-item similarity.

> I think I need to provide different DataModels to the different stages of
> recommendation calculation: 1) one which includes likes and dislike for
> items and categories for similarity search, 2) one which includes just the
> liked items to pick the recommendations from and 3) one which includes all
> items of a user (liked, disliked and skipped ones) for filtering out the
> user's items using an IDRescorer.

I think one DataModel is fine. You want to include all data in
similarity calculations (1). It is also good to have all items
available in recommendation (2); you don't want to exclude an item
just because someone didn't like it. And in (3) you do not need to
filter out items the user has rated; that's done already.

Re: Lucene for UserSimilarity

Posted by Uwe Reimann <uw...@junocon.de>.

Am 23.05.2011 15:59, schrieb Grant Ingersoll:
> On May 23, 2011, at 2:40 AM, Uwe Reimann wrote:
>
>> Hi,
>>
>> I'm currently integrating mahout's recommendation engine into a site.
>>
>> I'm not quite clear what DataModel to use. PostgresJdbcDataModel looks handy, but seem to produce way to many queries. ReloadFromJDBCDataModel seems to address that problem but still needs to calculate the similarity of a given user to every other user in the system.
>>
>> Would it be possible and performant to use lucene to perform the search for the top n most similar users, provided an index exists where the user id is the document id and the preferences of the users are the term vectors?
> It is certainly possible, but I don't know that Term Vectors will give you the performance you are looking for.
>
> You might find http://www.lucidimagination.com/search/document/c82c577e1e28259f/problems_with_itembasedrecommender_with_lucene#c82c577e1e28259f helpful as I think it describes a better way of leveraging Lucene for the problem.   That being said, doesn't Mahout's recommender have the necessary pieces as well to do what you want?
Maybe, just trying to figure that out.

The app suggest items to the user, one at a time. Those suggestions 
might come from a recommender. The user likes, dislikes or just skips 
the item and sees the next suggestion.

Since the user provides new preferences at a high rate, I expect to 
change the neighborhood of an individual user rapidly. Using 
CachingUserSimilarity or CachingUserNeighborhood probably won't work 
here. Using a ClusteringRecommender seems to be an option here in order 
to search against some clusters instead against many users. The cluster 
should be recalculated periodically in the background.

Dislikes should be considered during similarity search. I'd like to 
express those as negative preference values. 
PearsonCorrelationSimilarity should be ok with that, right?

Since I expect to have very low overlap in items between (especially 
new) users, I'd like to take the item's category into account during 
similarity search. User u1, who likes items i1 of category c1 should get 
item i2 of category c1 recommended if user u2 likes that. Both users 
would have a preference value for category c1 in common. This should 
clearly be possible by just providing the calculated preference values 
for the category items.

I think I need to provide different DataModels to the different stages 
of recommendation calculation: 1) one which includes likes and dislike 
for items and categories for similarity search, 2) one which includes 
just the liked items to pick the recommendations from and 3) one which 
includes all items of a user (liked, disliked and skipped ones) for 
filtering out the user's items using an IDRescorer.

I've been using lucene before and I know it performs quite well. For my 
use case it was fast enough to do instant search. I think the way lucene 
works is pretty close to finding a user neighborhood in mahout, so 
leveraging lucene's power for that came to my mind. I might be wrong here.

Regards, Uwe

> -Grant

Re: Lucene for UserSimilarity

Posted by Grant Ingersoll <gs...@apache.org>.

On May 23, 2011, at 2:40 AM, Uwe Reimann wrote:

> Hi,
> 
> I'm currently integrating mahout's recommendation engine into a site.
> 
> I'm not quite clear what DataModel to use. PostgresJdbcDataModel looks handy, but seem to produce way to many queries. ReloadFromJDBCDataModel seems to address that problem but still needs to calculate the similarity of a given user to every other user in the system.
> 
> Would it be possible and performant to use lucene to perform the search for the top n most similar users, provided an index exists where the user id is the document id and the preferences of the users are the term vectors?

It is certainly possible, but I don't know that Term Vectors will give you the performance you are looking for.  

You might find http://www.lucidimagination.com/search/document/c82c577e1e28259f/problems_with_itembasedrecommender_with_lucene#c82c577e1e28259f helpful as I think it describes a better way of leveraging Lucene for the problem.   That being said, doesn't Mahout's recommender have the necessary pieces as well to do what you want?

-Grant

Re: Lucene for UserSimilarity

Posted by Sean Owen <sr...@gmail.com>.

Yes I imagine that's possible, building an implementation under
UserSImilarity. I think you would want CachingUserSimilarity on top to
cache values.

On Mon, May 23, 2011 at 10:40 AM, Uwe Reimann <uw...@junocon.de> wrote:
> Hi,
>
> I'm currently integrating mahout's recommendation engine into a site.
>
> I'm not quite clear what DataModel to use. PostgresJdbcDataModel looks
> handy, but seem to produce way to many queries. ReloadFromJDBCDataModel
> seems to address that problem but still needs to calculate the similarity of
> a given user to every other user in the system.
>
> Would it be possible and performant to use lucene to perform the search for
> the top n most similar users, provided an index exists where the user id is
> the document id and the preferences of the users are the term vectors?
>
> If that's possible, would it additionally be possible to use negative values
> in the term vector for recording dislikes of the user?
> Best regards,
>
> Uwe
>
>