You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matthew Runo <ma...@gmail.com> on 2011/02/07 19:33:54 UTC

Item Similarity Calculations

Hello folks -

It's time for another "question that shouldn't be a question" from Matthew!

I saw by reading javadoc that when implementing item based
recommenders using the GenericItemBasedRecommender that we're supposed
to precompute all the item-item similarities rather than using
something like the UncenteredCosineSimilarity class on-the-fly.

While I can appreciate the time savings of using precomputed
similarities, and I can appreciate the customizability of
pre-computing these.. wouldn't this lead to an upkeep nightmare in the
long run? Every time I add or remove something, I have to compair the
new item to everything else... removing I suppose is somewhat easier
since we just delete the entries and refresh the DataModel.

Every time a new item is added, or an old item is removed, everything
needs to be adjusted - right? Say we have a system with 150,000 items
- that's 22,500,000,000 rows that go into the similarity table
(assuming a dumb loop within a loop style calculator). That would take
a beefy database server just to serve that one table quickly for the
system.

Does it make sense to leave out entries for items that are totally
dissimilar? I assume that might hurt the "long tail" of
recommendations... Are there other optimizations that I'm just not
seeing?

When computing an item similarity it seems that there are two schools
of thought - one using user preferences towards it and other items to
compute, and one using item attributes to compute. I assume that if I
wanted to go towards the item attributes computation then I'm getting
into the clustering algorithms, and that in that case I would build
clusters using the item attributes and go from there?

I'm sorry for the basic questions. Hopefully they have basic answers,
and hopefully this will all help other people who, like me, are not
experts in the fields of statistics and linear equations..

I really appreciate any and all help,

--Matthew Runo

Re: Item Similarity Calculations

Posted by Sean Owen <sr...@gmail.com>.
That recommendation in the javadoc may be a little too strong. As you say
it's really a trade-off between memory consumption, run-time speed, and cost
of updates. At a certain scale you can't fit in memory.

You can completely pre-compute with GenericItemSimilarity.
You can wrap an implementation with CachingItemSimilarity to have it at
least remember what it has computed already, rather than precompute.
You can just use an ItemSimilarity directly -- slowest in theory, but no
memory overhead.


Pruning can be a great thing. You want to prune the item-item similarities
that have the least information first. That is not necessarily a function of
the similarity value, though for certain algorithms it can be. For example,
when using a similarity based on the Pearson correlation (aka cosine
similarity), a value near 0.0 does suggest the item-item similarity has no
information. But a 0.0 does not always mean this -- it depends on the
computation. More generally, item-item similarities based on fewer data
points are probably less reliable and therefore better to prune.


You don't need clustering to produce an attribute- or content-based notion
of item similarity. Broadly, this notion of similarity is an input to these
processes, not an output.

As far as Mahout goes, the world of recommendations thinks of content-based
item similarity as out of scope. It's something quite domain-specific. You
can plug in your own notion of ItemSimilarity but it's necessarily your own,
for your domain. Clustering goes a little farther : if you can somehow map
those attributes to numerical values in a vector, it can compute similarity.
But the selection of features is often, again, domain-specific and up to
you.



On Mon, Feb 7, 2011 at 6:33 PM, Matthew Runo <ma...@gmail.com> wrote:

> Hello folks -
>
> It's time for another "question that shouldn't be a question" from Matthew!
>
> I saw by reading javadoc that when implementing item based
> recommenders using the GenericItemBasedRecommender that we're supposed
> to precompute all the item-item similarities rather than using
> something like the UncenteredCosineSimilarity class on-the-fly.
>
> While I can appreciate the time savings of using precomputed
> similarities, and I can appreciate the customizability of
> pre-computing these.. wouldn't this lead to an upkeep nightmare in the
> long run? Every time I add or remove something, I have to compair the
> new item to everything else... removing I suppose is somewhat easier
> since we just delete the entries and refresh the DataModel.
>
> Every time a new item is added, or an old item is removed, everything
> needs to be adjusted - right? Say we have a system with 150,000 items
> - that's 22,500,000,000 rows that go into the similarity table
> (assuming a dumb loop within a loop style calculator). That would take
> a beefy database server just to serve that one table quickly for the
> system.
>
> Does it make sense to leave out entries for items that are totally
> dissimilar? I assume that might hurt the "long tail" of
> recommendations... Are there other optimizations that I'm just not
> seeing?
>
> When computing an item similarity it seems that there are two schools
> of thought - one using user preferences towards it and other items to
> compute, and one using item attributes to compute. I assume that if I
> wanted to go towards the item attributes computation then I'm getting
> into the clustering algorithms, and that in that case I would build
> clusters using the item attributes and go from there?
>
> I'm sorry for the basic questions. Hopefully they have basic answers,
> and hopefully this will all help other people who, like me, are not
> experts in the fields of statistics and linear equations..
>
> I really appreciate any and all help,
>
> --Matthew Runo
>

Re: Item Similarity Calculations

Posted by Sebastian Schelter <ss...@apache.org>.
If you can live with only precomputing the similarities every once and 
then, you can use 
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob to run the 
computation in hadoop.

The result of this job can be put in a .txt file and can be loaded into 
an ItembasedRecommender via 
o.a.m.cf.taste.impl.similarity.file.FileItemSimilarity. The hadoop job 
also offers an option to only keep the top n similar items per item 
which gives you control over the result size.

--sebastian


On 07.02.2011 19:33, Matthew Runo wrote:
> Hello folks -
>
> It's time for another "question that shouldn't be a question" from Matthew!
>
> I saw by reading javadoc that when implementing item based
> recommenders using the GenericItemBasedRecommender that we're supposed
> to precompute all the item-item similarities rather than using
> something like the UncenteredCosineSimilarity class on-the-fly.
>
> While I can appreciate the time savings of using precomputed
> similarities, and I can appreciate the customizability of
> pre-computing these.. wouldn't this lead to an upkeep nightmare in the
> long run? Every time I add or remove something, I have to compair the
> new item to everything else... removing I suppose is somewhat easier
> since we just delete the entries and refresh the DataModel.
>
> Every time a new item is added, or an old item is removed, everything
> needs to be adjusted - right? Say we have a system with 150,000 items
> - that's 22,500,000,000 rows that go into the similarity table
> (assuming a dumb loop within a loop style calculator). That would take
> a beefy database server just to serve that one table quickly for the
> system.
>
> Does it make sense to leave out entries for items that are totally
> dissimilar? I assume that might hurt the "long tail" of
> recommendations... Are there other optimizations that I'm just not
> seeing?
>
> When computing an item similarity it seems that there are two schools
> of thought - one using user preferences towards it and other items to
> compute, and one using item attributes to compute. I assume that if I
> wanted to go towards the item attributes computation then I'm getting
> into the clustering algorithms, and that in that case I would build
> clusters using the item attributes and go from there?
>
> I'm sorry for the basic questions. Hopefully they have basic answers,
> and hopefully this will all help other people who, like me, are not
> experts in the fields of statistics and linear equations..
>
> I really appreciate any and all help,
>
> --Matthew Runo