You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/10/29 12:19:43 UTC

Re: Offline computation of item-item similarities

On Wed, Oct 28, 2009 at 3:06 PM, Gökhan Çapan <gk...@gmail.com> wrote:
> similarities at once. For example, loglikelihood similarity does something
> with 4 quantities of two items:
> item1item2   : # of users who rated both item1 and item2
> !item1item2  : # of users who rated item2 but not item1
> item1!item2  : # of users who rated item1 but not item2
> !item1!item2 : # of users who didn't rate item1 or item2
> It is easy to find these quantities with mapreduce. When I finish
> implementing it, I will share it with a patch.

Yes you can easily compute these offline, and then use them in an
alternate implementation of LogLikelihoodSimilarity to produce fast
results. The only possible issue I see is loading all these values
into memory, since it grows as the square of the number of items.

You could store only 'most-popular' pairs in memory and fall back to a
normal computation when the pair hasn't been precomputed. This amounts
to caching, and might be done as easily by simply writing a caching
wrapper around LogLikelihoodSimilarity, which caches with the "Cache"
class, which will do a nice job of limiting the memory usage and
removing unpopular entries.

> There is also another measure that we talked about in our discussion. Ted
> said dot product can yield very good results and it can be weighted with
> inverse user frequency.
> I've implemented 3 versions of that approach to see how good the results
> are:

Agree, this is what PearsonCorrelationSimilarity basically does --
really it's also an implementation of the cosine-measure similarity,
since the data is 'centered' to a mean of zero. And I believe this is
what Ted is referring to by dot product. The framework also already
has a transformation for inverse user frequency.

But yeah I think your point is doing this at scale, all at once, with
a big matrix operation, via mapreduce, for all pairs of items. I agree
this is a useful thing. The results, when precomputed this way, make
the on-line similarity computation very fast. You could feed it in to
any item-based recommender that way.

Or you could rewrite the rest of the recommender process in mapreduce,
continue that way.

> First, I have worked on 1 month user log of an e-commerce site, data set is
> very sparse, but there are some items that are very popular, that don't fit
> to the general characteristics of data.
>
> I've tried 3 versions of that computation:
> 1- iuf=log(N/1+uf)  ; N is total number of users here

Yep

> score item1,item2 = cosine similarity of item1 item2 vectors whose elements
> are weighted with iuf that is described above
>                                 the dot product of two items is normalized
> with euclidian length of items(i.e.
> score=dotproduct/length(item1)*length(item2))

PS yeah this is the cosine measure similarity, and also reduces to
Pearson when the data has mean zero.

> 2-iuf is same as 1, score is just dot product

I think it's problematic to not normalize the dot product. I wouldn't
expect this to work well, but let's see your results below --

> 3- iuf=N/1+uf, and score is just dot product

Valid, but I do think the log() is useful to keep here.

> As a result, I think it will be good it will be good if we found a way to
> smooth item vectors to make 1st way yield better results. For example, with
> filling some missing data. And it will be good if we had a way to merge 2nd
> and 3rd ways. I mean, another similarity measure that find hidden -hard to
> find- similar items like 2, and gives the results as 3rd way also.

"Filling missing data" -- yes this is what the "PreferenceInferrer"
does in the framework. I am not convinced it is very useful, but lets
you synthesize missing data.

I also suspect 1 is the way to go. It is definitely the conventional
approach and the one already implemented by the framework.

Re: Offline computation of item-item similarities

Posted by Gökhan Çapan <gk...@gmail.com>.

> Yes you can easily compute these offline, and then use them in an
> alternate implementation of LogLikelihoodSimilarity to produce fast
> results. The only possible issue I see is loading all these values
> into memory, since it grows as the square of the number of items.
>
> You could store only 'most-popular' pairs in memory and fall back to a
> normal computation when the pair hasn't been precomputed. This amounts
> to caching, and might be done as easily by simply writing a caching
> wrapper around LogLikelihoodSimilarity, which caches with the "Cache"
> class, which will do a nice job of limiting the memory usage and
> removing unpopular entries.


That sounds interesting.

>
>
> > There is also another measure that we talked about in our discussion. Ted
> > said dot product can yield very good results and it can be weighted with
> > inverse user frequency.
> > I've implemented 3 versions of that approach to see how good the results
> > are:
>
> Agree, this is what PearsonCorrelationSimilarity basically does --
> really it's also an implementation of the cosine-measure similarity,
> since the data is 'centered' to a mean of zero. And I believe this is
> what Ted is referring to by dot product. The framework also already
> has a transformation for inverse user frequency.
>
> But yeah I think your point is doing this at scale, all at once, with
> a big matrix operation, via mapreduce, for all pairs of items. I agree
> this is a useful thing. The results, when precomputed this way, make
> the on-line similarity computation very fast. You could feed it in to
> any item-based recommender that way.
>
> Or you could rewrite the rest of the recommender process in mapreduce,
> continue that way.
>
> > First, I have worked on 1 month user log of an e-commerce site, data set
> is
> > very sparse, but there are some items that are very popular, that don't
> fit
> > to the general characteristics of data.
> >
> > I've tried 3 versions of that computation:
> > 1- iuf=log(N/1+uf)  ; N is total number of users here
>
> Yep
>
> > score item1,item2 = cosine similarity of item1 item2 vectors whose
> elements
> > are weighted with iuf that is described above
> >                                 the dot product of two items is
> normalized
> > with euclidian length of items(i.e.
> > score=dotproduct/length(item1)*length(item2))
>
> PS yeah this is the cosine measure similarity, and also reduces to
> Pearson when the data has mean zero.
>
>
> > 2-iuf is same as 1, score is just dot product
>
> I think it's problematic to not normalize the dot product. I wouldn't
> expect this to work well, but let's see your results below --
>
>
> > 3- iuf=N/1+uf, and score is just dot product
>
> Valid, but I do think the log() is useful to keep here.
>
>
> > As a result, I think it will be good it will be good if we found a way to
> > smooth item vectors to make 1st way yield better results. For example,
> with
> > filling some missing data. And it will be good if we had a way to merge
> 2nd
> > and 3rd ways. I mean, another similarity measure that find hidden -hard
> to
> > find- similar items like 2, and gives the results as 3rd way also.
>
> "Filling missing data" -- yes this is what the "PreferenceInferrer"
> does in the framework. I am not convinced it is very useful, but lets
> you synthesize missing data.
>
> I also suspect 1 is the way to go. It is definitely the conventional
> approach and the one already implemented by the framework.
>

For your comments, I should say it is very application-dependent. I used to
think same as you, but after some experiments, I saw all give different,
interesting results depending on data set. Thanks for sharing your ideas.

-- 
Gökhan Çapan