You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Gruszowska Natalia <Na...@grupaonet.pl> on 2014/12/10 16:40:43 UTC

Collaborative filtering item-based in mahout - without isolating users

Hi All,

In mahout there is implemented method for item based Collaborative filtering called itemsimilarity, which returns the "similarity" between each two items.
In the theory, similarity between two items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.
Below two examples.

Example 1. items are 11-12
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36. It looks like mahout treats null as 0.
Similarity between items:
101     102     0.36602540378443865

Matrix with preferences:
            11       12
1                     1
2                     1
3           1         1
4                     1

Example 2. items are 101-103.
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.
Similarity between items:
101     102     0.2612038749637414
101     103     0.4340578302732228
102     103     0.2600070276638468

Matrix with preferences:
            101      102        103
1                     1         0.1
2                     1         0.1
3                     1         0.1
4           1         1         0.1
5           1         1         0.1
6                     1         0.1
7                     1         0.1
8                     1         0.1
9                     1         0.1
10                    1         0.1


Both examples were run without any additional parameters.
Is this problem solved somewhere, somehow? Any ideas? Why null is treated as 0?
Source: http://files.grouplens.org/papers/www10_sarwar.pdf



Kind regards,
Natalia Gruszowska



Re: itemsimilarity - maxPrefs parameter

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Increase the number to Integer.max or the highest of your number of users or items. The “or" means that the row and columns are both downsampled to that number or less.

To use all data you will also have to increase the —maxSimilaritiesPerItem

There are two marices in the Hadoop itemsimilarity. The input is A, and is one row per user with each item the user has interacted with. From this AtA is calculated as the output using LLR instead of actual matrix multiplication. This yields an AtA with values weighted but LLR strength. —maxSimilaritiesPerItem will further limit the values here to no more than that number. There is also a quality threshold, which is pretty difficult to use.

If you remove all of these downsampling params you will approach O(n^2) runtime, if you use them you will have O(n). You will also get rapidly diminishing returns by removing downsampling.

The indicator matrix will have arbitrarily many similar items of diminishing strength, some could be nearly useless. This potentially large vector may be unwieldy in you other calculations and has not had low value similar items filtered out.

Bottom line it that the downsampling is possible to tweak but removal altogether is not likely to be a good thing.


On Dec 12, 2014, at 6:18 AM, Gruszowska Natalia <Na...@grupaonet.pl> wrote:

Hi All, 

In itemsimilarity metod tere is a parameter like:

--maxPrefs (-mppu) maxPrefs                               max number of
                                                         preferences to
                                                         consider per user or
                                                         item, users or items
                                                         with more preferences
                                                         will be sampled down
                                                         (default: 500)

How does it work exactly?
If I have 5 mln users and 5000 items and I run itemsimilarity with default maxPrefs, it consider only 500 ranks from those 5 mln or what? Is it sampling? What can I do to force calculation for all input data? 

			M1   M2   M3 .... M5000
U_1
U_2
...
U_5000000

What does mean "or" in definition:
"max number of preferences to consider per user or item"


Thx in advance
Natalia




itemsimilarity - maxPrefs parameter

Posted by Gruszowska Natalia <Na...@grupaonet.pl>.
Hi All, 

In itemsimilarity metod tere is a parameter like:

--maxPrefs (-mppu) maxPrefs                               max number of
                                                          preferences to
                                                          consider per user or
                                                          item, users or items
                                                          with more preferences
                                                          will be sampled down
                                                          (default: 500)

How does it work exactly?
If I have 5 mln users and 5000 items and I run itemsimilarity with default maxPrefs, it consider only 500 ranks from those 5 mln or what? Is it sampling? What can I do to force calculation for all input data? 

			M1   M2   M3 .... M5000
U_1
U_2
...
U_5000000

What does mean "or" in definition:
"max number of preferences to consider per user or item"


Thx in advance
Natalia
                                                          
                                                          

Re: Collaborative filtering item-based in mahout - without isolating users

Posted by Ted Dunning <te...@gmail.com>.
Natalia,

It sounds like you are starting from the assumption that ratings are being
done.

This can happen, but in production recommendation settings, ratings is
typically a very low value input because the meaning of a rating is very
complex and because so few users actually do ratings unless forced into
unnatural acts.

Instead, you typically wind up using other kinds of actions.  If you do use
ratings, it is often better to ignore the value of the rating and use the
mere fact of the rating.  It is also common to assume that all users
*could* have interacted with any item even if they didn't.  This assumption
is suspect, but it is better than assuming that lack of interaction really
means lack of opportunity.

Adjusting your assumptions to fit these leads, I think, to the approach
used by Mahout.



On Thu, Dec 11, 2014 at 2:51 AM, Gruszowska Natalia <
Natalia.Gruszowska@grupaonet.pl> wrote:

> Mario,
> I think in terms of correctness. In similarities like Euclidean, Pearson
> correlation or Cosine Similarity better results are if we consider only
> common users (users who rated both compared items). This assumption let to
> find similar item for those which are unpopular, otherwise we recommend
> only very popular items. For my data it is unacceptable.
>
> "But if you take, for example, the cosine similarity, you shouldn't throw
> away the data." - you should, it result in dimension reduction and it is
> good. Everything is still in the same space but for each pair the space is
> reduced.
>
> My question is why someone who wrote this code ignored this so important
> assumption? It was by accident or due to some important reasons like
> effectiveness or computational complexity?
>
>
> Natalia
>
>
> -----Original Message-----
> From: mario.alemi@gmail.com [mailto:mario.alemi@gmail.com]
> Sent: Wednesday, December 10, 2014 7:05 PM
> To: user@mahout.apache.org
> Subject: Re: Collaborative filtering item-based in mahout - without
> isolating users
>
> Hi Natalia
>
> Regarding example 1, if you think in terms of likelihood that the two
> products have been bought together because they are similar (opposed to by
> chance), the similarity is undefined. As everyone buys 12, of course the
> person who bought 11 bough also 12, right?
>
> This if you compute the similarity through a co-occurence matrix (and
> loglikelihood ratio)
>
> But you say "In the theory, similarity between two items should be
> calculated only for users who ranked both items".
>
> I guess you mean: "Users [1,2,4] don't know about item 11, therefore they
> do not collaborate in building the similarity between the two items. User
> [3], on the contrary, does, and gives the same rating to the two products,
> therefore the similarity is 1".
>
> But if you take, for example, the cosine similarity, you shouldn't throw
> away the data. Here, you build a space with four dimensions -the ratings of
> four users. You can't say product 11 is on another space when it relates
> with user 1,2,4 because hasn't been rated by those users. They all are
> there. They are dimensions, like in physics. Therefore you must use this
> information too. Items are in the user-space... all.
>
> Even intuitively, items 11 and 12 are not similar at all -one has been
> bought by every customer, the other by just one customer. How could you
> tell the next customer who buys 12 (everyone does...) that she would really
> like 11...?
>
> Mario
>
>
> On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia <
> Natalia.Gruszowska@grupaonet.pl> wrote:
>
> > Hi All,
> >
> > In mahout there is implemented method for item based Collaborative
> > filtering called itemsimilarity, which returns the "similarity"
> > between each two items.
> > In the theory, similarity between two items should be calculated only
> > for users who ranked both items. During testing I realized that in
> > mahout it works different.
> > Below two examples.
> >
> > Example 1. items are 11-12
> > In below example the similarity between item 11 and 12 should be equal
> > 1, but mahout output is 0.36. It looks like mahout treats null as 0.
> > Similarity between items:
> > 101     102     0.36602540378443865
> >
> > Matrix with preferences:
> >             11       12
> > 1                     1
> > 2                     1
> > 3           1         1
> > 4                     1
> >
> > Example 2. items are 101-103.
> > Similarity between items 101 and 102 should be calculated using only
> > ranks for users 4 and 5, and the same for items 101 and 103 (that
> > should be based on theory). Here (101,103) is more similar than
> > (101,102), and it shouldn't be.
> > Similarity between items:
> > 101     102     0.2612038749637414
> > 101     103     0.4340578302732228
> > 102     103     0.2600070276638468
> >
> > Matrix with preferences:
> >             101      102        103
> > 1                     1         0.1
> > 2                     1         0.1
> > 3                     1         0.1
> > 4           1         1         0.1
> > 5           1         1         0.1
> > 6                     1         0.1
> > 7                     1         0.1
> > 8                     1         0.1
> > 9                     1         0.1
> > 10                    1         0.1
> >
> >
> > Both examples were run without any additional parameters.
> > Is this problem solved somewhere, somehow? Any ideas? Why null is
> > treated as 0?
> > Source: http://files.grouplens.org/papers/www10_sarwar.pdf
> >
> >
> >
> > Kind regards,
> > Natalia Gruszowska
> >
> >
> >
>

Re: Collaborative filtering item-based in mahout - without isolating users

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Using LLR ratings are ignored. It is only interested in whether there was an interaction between the user and the item. LLR calculates its own weights based on a probabilistic measure of cooccurrence importance. Cooccurrences are all it looks at so 0 is ignored, it does not indicate a negative preference it mean any preference is undefined or non-existant. In fact those implied 0s in a particular user’s history are exactly where recommendations will come from since we don’t want to recommend something the user already know about.

The root of your question is a bit hard to explain since it requires a knowledge of cooccurrence recommenders and the LLR calculation itself. So you can read these for more explanation:
A short ebook here that talks about LLR: https://www.mapr.com/practical-machine-learning
a blog post here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
wikipedia: http://en.wikipedia.org/wiki/Likelihood_function

The spark version “spark-itemsimilarity” can take in multiple actions/events, calculate a cross-cooccurrence with the primary action to determine the strength of correlation, and use the secondary data to improve recs. This is a better way to handle thumbs up/thumbs down or other user actions in a recommender since it automatically determines correlation strength, not relying on user or developer supplied weights.

Ratings are often problematic, people rate on different scales at different times on different subjects. There have been many algorithms proposed to deal with this but most new research deals with optimizing the ranking order of recommendations which is usually more important in the application.

On Dec 11, 2014, at 4:23 AM, Gruszowska Natalia <Na...@grupaonet.pl> wrote:

To be honest I haven't seen the code of this similarity (do you have?). But then as I see it, it ignore other side - this time popular items and additional it looks like it ignore value of ratig - has only 1 or 0.

N.

-----Original Message-----
From: mario.alemi@gmail.com [mailto:mario.alemi@gmail.com] 
Sent: Thursday, December 11, 2014 12:00 PM
To: user@mahout.apache.org
Subject: Re: Collaborative filtering item-based in mahout - without isolating users

> otherwise we recommend only very popular items

this is why you have loglikelihood ratio, right?
m

On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia < Natalia.Gruszowska@grupaonet.pl> wrote:

> Mario,
> I think in terms of correctness. In similarities like Euclidean, 
> Pearson correlation or Cosine Similarity better results are if we 
> consider only common users (users who rated both compared items). This 
> assumption let to find similar item for those which are unpopular, 
> otherwise we recommend only very popular items. For my data it is unacceptable.
> 
> "But if you take, for example, the cosine similarity, you shouldn't 
> throw away the data." - you should, it result in dimension reduction 
> and it is good. Everything is still in the same space but for each 
> pair the space is reduced.
> 
> My question is why someone who wrote this code ignored this so 
> important assumption? It was by accident or due to some important 
> reasons like effectiveness or computational complexity?
> 
> 
> Natalia
> 
> 
> -----Original Message-----
> From: mario.alemi@gmail.com [mailto:mario.alemi@gmail.com]
> Sent: Wednesday, December 10, 2014 7:05 PM
> To: user@mahout.apache.org
> Subject: Re: Collaborative filtering item-based in mahout - without 
> isolating users
> 
> Hi Natalia
> 
> Regarding example 1, if you think in terms of likelihood that the two 
> products have been bought together because they are similar (opposed 
> to by chance), the similarity is undefined. As everyone buys 12, of 
> course the person who bought 11 bough also 12, right?
> 
> This if you compute the similarity through a co-occurence matrix (and 
> loglikelihood ratio)
> 
> But you say "In the theory, similarity between two items should be 
> calculated only for users who ranked both items".
> 
> I guess you mean: "Users [1,2,4] don't know about item 11, therefore 
> they do not collaborate in building the similarity between the two 
> items. User [3], on the contrary, does, and gives the same rating to 
> the two products, therefore the similarity is 1".
> 
> But if you take, for example, the cosine similarity, you shouldn't 
> throw away the data. Here, you build a space with four dimensions -the 
> ratings of four users. You can't say product 11 is on another space 
> when it relates with user 1,2,4 because hasn't been rated by those 
> users. They all are there. They are dimensions, like in physics. 
> Therefore you must use this information too. Items are in the user-space... all.
> 
> Even intuitively, items 11 and 12 are not similar at all -one has been 
> bought by every customer, the other by just one customer. How could 
> you tell the next customer who buys 12 (everyone does...) that she 
> would really like 11...?
> 
> Mario
> 
> 
> On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia < 
> Natalia.Gruszowska@grupaonet.pl> wrote:
> 
>> Hi All,
>> 
>> In mahout there is implemented method for item based Collaborative 
>> filtering called itemsimilarity, which returns the "similarity"
>> between each two items.
>> In the theory, similarity between two items should be calculated 
>> only for users who ranked both items. During testing I realized that 
>> in mahout it works different.
>> Below two examples.
>> 
>> Example 1. items are 11-12
>> In below example the similarity between item 11 and 12 should be 
>> equal 1, but mahout output is 0.36. It looks like mahout treats null as 0.
>> Similarity between items:
>> 101     102     0.36602540378443865
>> 
>> Matrix with preferences:
>>            11       12
>> 1                     1
>> 2                     1
>> 3           1         1
>> 4                     1
>> 
>> Example 2. items are 101-103.
>> Similarity between items 101 and 102 should be calculated using only 
>> ranks for users 4 and 5, and the same for items 101 and 103 (that 
>> should be based on theory). Here (101,103) is more similar than 
>> (101,102), and it shouldn't be.
>> Similarity between items:
>> 101     102     0.2612038749637414
>> 101     103     0.4340578302732228
>> 102     103     0.2600070276638468
>> 
>> Matrix with preferences:
>>            101      102        103
>> 1                     1         0.1
>> 2                     1         0.1
>> 3                     1         0.1
>> 4           1         1         0.1
>> 5           1         1         0.1
>> 6                     1         0.1
>> 7                     1         0.1
>> 8                     1         0.1
>> 9                     1         0.1
>> 10                    1         0.1
>> 
>> 
>> Both examples were run without any additional parameters.
>> Is this problem solved somewhere, somehow? Any ideas? Why null is 
>> treated as 0?
>> Source: http://files.grouplens.org/papers/www10_sarwar.pdf
>> 
>> 
>> 
>> Kind regards,
>> Natalia Gruszowska
>> 
>> 
>> 
> 


RE: Collaborative filtering item-based in mahout - without isolating users

Posted by Gruszowska Natalia <Na...@grupaonet.pl>.
To be honest I haven't seen the code of this similarity (do you have?). But then as I see it, it ignore other side - this time popular items and additional it looks like it ignore value of ratig - has only 1 or 0.

N.

-----Original Message-----
From: mario.alemi@gmail.com [mailto:mario.alemi@gmail.com] 
Sent: Thursday, December 11, 2014 12:00 PM
To: user@mahout.apache.org
Subject: Re: Collaborative filtering item-based in mahout - without isolating users

> otherwise we recommend only very popular items

this is why you have loglikelihood ratio, right?
m

On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia < Natalia.Gruszowska@grupaonet.pl> wrote:

> Mario,
> I think in terms of correctness. In similarities like Euclidean, 
> Pearson correlation or Cosine Similarity better results are if we 
> consider only common users (users who rated both compared items). This 
> assumption let to find similar item for those which are unpopular, 
> otherwise we recommend only very popular items. For my data it is unacceptable.
>
> "But if you take, for example, the cosine similarity, you shouldn't 
> throw away the data." - you should, it result in dimension reduction 
> and it is good. Everything is still in the same space but for each 
> pair the space is reduced.
>
> My question is why someone who wrote this code ignored this so 
> important assumption? It was by accident or due to some important 
> reasons like effectiveness or computational complexity?
>
>
> Natalia
>
>
> -----Original Message-----
> From: mario.alemi@gmail.com [mailto:mario.alemi@gmail.com]
> Sent: Wednesday, December 10, 2014 7:05 PM
> To: user@mahout.apache.org
> Subject: Re: Collaborative filtering item-based in mahout - without 
> isolating users
>
> Hi Natalia
>
> Regarding example 1, if you think in terms of likelihood that the two 
> products have been bought together because they are similar (opposed 
> to by chance), the similarity is undefined. As everyone buys 12, of 
> course the person who bought 11 bough also 12, right?
>
> This if you compute the similarity through a co-occurence matrix (and 
> loglikelihood ratio)
>
> But you say "In the theory, similarity between two items should be 
> calculated only for users who ranked both items".
>
> I guess you mean: "Users [1,2,4] don't know about item 11, therefore 
> they do not collaborate in building the similarity between the two 
> items. User [3], on the contrary, does, and gives the same rating to 
> the two products, therefore the similarity is 1".
>
> But if you take, for example, the cosine similarity, you shouldn't 
> throw away the data. Here, you build a space with four dimensions -the 
> ratings of four users. You can't say product 11 is on another space 
> when it relates with user 1,2,4 because hasn't been rated by those 
> users. They all are there. They are dimensions, like in physics. 
> Therefore you must use this information too. Items are in the user-space... all.
>
> Even intuitively, items 11 and 12 are not similar at all -one has been 
> bought by every customer, the other by just one customer. How could 
> you tell the next customer who buys 12 (everyone does...) that she 
> would really like 11...?
>
> Mario
>
>
> On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia < 
> Natalia.Gruszowska@grupaonet.pl> wrote:
>
> > Hi All,
> >
> > In mahout there is implemented method for item based Collaborative 
> > filtering called itemsimilarity, which returns the "similarity"
> > between each two items.
> > In the theory, similarity between two items should be calculated 
> > only for users who ranked both items. During testing I realized that 
> > in mahout it works different.
> > Below two examples.
> >
> > Example 1. items are 11-12
> > In below example the similarity between item 11 and 12 should be 
> > equal 1, but mahout output is 0.36. It looks like mahout treats null as 0.
> > Similarity between items:
> > 101     102     0.36602540378443865
> >
> > Matrix with preferences:
> >             11       12
> > 1                     1
> > 2                     1
> > 3           1         1
> > 4                     1
> >
> > Example 2. items are 101-103.
> > Similarity between items 101 and 102 should be calculated using only 
> > ranks for users 4 and 5, and the same for items 101 and 103 (that 
> > should be based on theory). Here (101,103) is more similar than 
> > (101,102), and it shouldn't be.
> > Similarity between items:
> > 101     102     0.2612038749637414
> > 101     103     0.4340578302732228
> > 102     103     0.2600070276638468
> >
> > Matrix with preferences:
> >             101      102        103
> > 1                     1         0.1
> > 2                     1         0.1
> > 3                     1         0.1
> > 4           1         1         0.1
> > 5           1         1         0.1
> > 6                     1         0.1
> > 7                     1         0.1
> > 8                     1         0.1
> > 9                     1         0.1
> > 10                    1         0.1
> >
> >
> > Both examples were run without any additional parameters.
> > Is this problem solved somewhere, somehow? Any ideas? Why null is 
> > treated as 0?
> > Source: http://files.grouplens.org/papers/www10_sarwar.pdf
> >
> >
> >
> > Kind regards,
> > Natalia Gruszowska
> >
> >
> >
>

Re: Collaborative filtering item-based in mahout - without isolating users

Posted by ma...@gmail.com.
> otherwise we recommend only very popular items

this is why you have loglikelihood ratio, right?
m

On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia <
Natalia.Gruszowska@grupaonet.pl> wrote:

> Mario,
> I think in terms of correctness. In similarities like Euclidean, Pearson
> correlation or Cosine Similarity better results are if we consider only
> common users (users who rated both compared items). This assumption let to
> find similar item for those which are unpopular, otherwise we recommend
> only very popular items. For my data it is unacceptable.
>
> "But if you take, for example, the cosine similarity, you shouldn't throw
> away the data." - you should, it result in dimension reduction and it is
> good. Everything is still in the same space but for each pair the space is
> reduced.
>
> My question is why someone who wrote this code ignored this so important
> assumption? It was by accident or due to some important reasons like
> effectiveness or computational complexity?
>
>
> Natalia
>
>
> -----Original Message-----
> From: mario.alemi@gmail.com [mailto:mario.alemi@gmail.com]
> Sent: Wednesday, December 10, 2014 7:05 PM
> To: user@mahout.apache.org
> Subject: Re: Collaborative filtering item-based in mahout - without
> isolating users
>
> Hi Natalia
>
> Regarding example 1, if you think in terms of likelihood that the two
> products have been bought together because they are similar (opposed to by
> chance), the similarity is undefined. As everyone buys 12, of course the
> person who bought 11 bough also 12, right?
>
> This if you compute the similarity through a co-occurence matrix (and
> loglikelihood ratio)
>
> But you say "In the theory, similarity between two items should be
> calculated only for users who ranked both items".
>
> I guess you mean: "Users [1,2,4] don't know about item 11, therefore they
> do not collaborate in building the similarity between the two items. User
> [3], on the contrary, does, and gives the same rating to the two products,
> therefore the similarity is 1".
>
> But if you take, for example, the cosine similarity, you shouldn't throw
> away the data. Here, you build a space with four dimensions -the ratings of
> four users. You can't say product 11 is on another space when it relates
> with user 1,2,4 because hasn't been rated by those users. They all are
> there. They are dimensions, like in physics. Therefore you must use this
> information too. Items are in the user-space... all.
>
> Even intuitively, items 11 and 12 are not similar at all -one has been
> bought by every customer, the other by just one customer. How could you
> tell the next customer who buys 12 (everyone does...) that she would really
> like 11...?
>
> Mario
>
>
> On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia <
> Natalia.Gruszowska@grupaonet.pl> wrote:
>
> > Hi All,
> >
> > In mahout there is implemented method for item based Collaborative
> > filtering called itemsimilarity, which returns the "similarity"
> > between each two items.
> > In the theory, similarity between two items should be calculated only
> > for users who ranked both items. During testing I realized that in
> > mahout it works different.
> > Below two examples.
> >
> > Example 1. items are 11-12
> > In below example the similarity between item 11 and 12 should be equal
> > 1, but mahout output is 0.36. It looks like mahout treats null as 0.
> > Similarity between items:
> > 101     102     0.36602540378443865
> >
> > Matrix with preferences:
> >             11       12
> > 1                     1
> > 2                     1
> > 3           1         1
> > 4                     1
> >
> > Example 2. items are 101-103.
> > Similarity between items 101 and 102 should be calculated using only
> > ranks for users 4 and 5, and the same for items 101 and 103 (that
> > should be based on theory). Here (101,103) is more similar than
> > (101,102), and it shouldn't be.
> > Similarity between items:
> > 101     102     0.2612038749637414
> > 101     103     0.4340578302732228
> > 102     103     0.2600070276638468
> >
> > Matrix with preferences:
> >             101      102        103
> > 1                     1         0.1
> > 2                     1         0.1
> > 3                     1         0.1
> > 4           1         1         0.1
> > 5           1         1         0.1
> > 6                     1         0.1
> > 7                     1         0.1
> > 8                     1         0.1
> > 9                     1         0.1
> > 10                    1         0.1
> >
> >
> > Both examples were run without any additional parameters.
> > Is this problem solved somewhere, somehow? Any ideas? Why null is
> > treated as 0?
> > Source: http://files.grouplens.org/papers/www10_sarwar.pdf
> >
> >
> >
> > Kind regards,
> > Natalia Gruszowska
> >
> >
> >
>

RE: Collaborative filtering item-based in mahout - without isolating users

Posted by Gruszowska Natalia <Na...@grupaonet.pl>.
Mario, 
I think in terms of correctness. In similarities like Euclidean, Pearson correlation or Cosine Similarity better results are if we consider only common users (users who rated both compared items). This assumption let to find similar item for those which are unpopular, otherwise we recommend only very popular items. For my data it is unacceptable.
	
"But if you take, for example, the cosine similarity, you shouldn't throw away the data." - you should, it result in dimension reduction and it is good. Everything is still in the same space but for each pair the space is reduced. 

My question is why someone who wrote this code ignored this so important assumption? It was by accident or due to some important reasons like effectiveness or computational complexity?  


Natalia


-----Original Message-----
From: mario.alemi@gmail.com [mailto:mario.alemi@gmail.com] 
Sent: Wednesday, December 10, 2014 7:05 PM
To: user@mahout.apache.org
Subject: Re: Collaborative filtering item-based in mahout - without isolating users

Hi Natalia

Regarding example 1, if you think in terms of likelihood that the two products have been bought together because they are similar (opposed to by chance), the similarity is undefined. As everyone buys 12, of course the person who bought 11 bough also 12, right?

This if you compute the similarity through a co-occurence matrix (and loglikelihood ratio)

But you say "In the theory, similarity between two items should be calculated only for users who ranked both items".

I guess you mean: "Users [1,2,4] don't know about item 11, therefore they do not collaborate in building the similarity between the two items. User [3], on the contrary, does, and gives the same rating to the two products, therefore the similarity is 1".

But if you take, for example, the cosine similarity, you shouldn't throw away the data. Here, you build a space with four dimensions -the ratings of four users. You can't say product 11 is on another space when it relates with user 1,2,4 because hasn't been rated by those users. They all are there. They are dimensions, like in physics. Therefore you must use this information too. Items are in the user-space... all.

Even intuitively, items 11 and 12 are not similar at all -one has been bought by every customer, the other by just one customer. How could you tell the next customer who buys 12 (everyone does...) that she would really like 11...?

Mario


On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia < Natalia.Gruszowska@grupaonet.pl> wrote:

> Hi All,
>
> In mahout there is implemented method for item based Collaborative 
> filtering called itemsimilarity, which returns the "similarity" 
> between each two items.
> In the theory, similarity between two items should be calculated only 
> for users who ranked both items. During testing I realized that in 
> mahout it works different.
> Below two examples.
>
> Example 1. items are 11-12
> In below example the similarity between item 11 and 12 should be equal 
> 1, but mahout output is 0.36. It looks like mahout treats null as 0.
> Similarity between items:
> 101     102     0.36602540378443865
>
> Matrix with preferences:
>             11       12
> 1                     1
> 2                     1
> 3           1         1
> 4                     1
>
> Example 2. items are 101-103.
> Similarity between items 101 and 102 should be calculated using only 
> ranks for users 4 and 5, and the same for items 101 and 103 (that 
> should be based on theory). Here (101,103) is more similar than 
> (101,102), and it shouldn't be.
> Similarity between items:
> 101     102     0.2612038749637414
> 101     103     0.4340578302732228
> 102     103     0.2600070276638468
>
> Matrix with preferences:
>             101      102        103
> 1                     1         0.1
> 2                     1         0.1
> 3                     1         0.1
> 4           1         1         0.1
> 5           1         1         0.1
> 6                     1         0.1
> 7                     1         0.1
> 8                     1         0.1
> 9                     1         0.1
> 10                    1         0.1
>
>
> Both examples were run without any additional parameters.
> Is this problem solved somewhere, somehow? Any ideas? Why null is 
> treated as 0?
> Source: http://files.grouplens.org/papers/www10_sarwar.pdf
>
>
>
> Kind regards,
> Natalia Gruszowska
>
>
>

Re: Collaborative filtering item-based in mahout - without isolating users

Posted by ma...@gmail.com.
Hi Natalia

Regarding example 1, if you think in terms of likelihood that the two
products have been bought together because they are similar (opposed to by
chance), the similarity is undefined. As everyone buys 12, of course the
person who bought 11 bough also 12, right?

This if you compute the similarity through a co-occurence matrix (and
loglikelihood ratio)

But you say "In the theory, similarity between two items should be
calculated only for users who ranked both items".

I guess you mean: "Users [1,2,4] don't know about item 11, therefore they
do not collaborate in building the similarity between the two items. User
[3], on the contrary, does, and gives the same rating to the two products,
therefore the similarity is 1".

But if you take, for example, the cosine similarity, you shouldn't throw
away the data. Here, you build a space with four dimensions -the ratings of
four users. You can't say product 11 is on another space when it relates
with user 1,2,4 because hasn't been rated by those users. They all are
there. They are dimensions, like in physics. Therefore you must use this
information too. Items are in the user-space... all.

Even intuitively, items 11 and 12 are not similar at all -one has been
bought by every customer, the other by just one customer. How could you
tell the next customer who buys 12 (everyone does...) that she would really
like 11...?

Mario


On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia <
Natalia.Gruszowska@grupaonet.pl> wrote:

> Hi All,
>
> In mahout there is implemented method for item based Collaborative
> filtering called itemsimilarity, which returns the "similarity" between
> each two items.
> In the theory, similarity between two items should be calculated only for
> users who ranked both items. During testing I realized that in mahout it
> works different.
> Below two examples.
>
> Example 1. items are 11-12
> In below example the similarity between item 11 and 12 should be equal 1,
> but mahout output is 0.36. It looks like mahout treats null as 0.
> Similarity between items:
> 101     102     0.36602540378443865
>
> Matrix with preferences:
>             11       12
> 1                     1
> 2                     1
> 3           1         1
> 4                     1
>
> Example 2. items are 101-103.
> Similarity between items 101 and 102 should be calculated using only ranks
> for users 4 and 5, and the same for items 101 and 103 (that should be based
> on theory). Here (101,103) is more similar than (101,102), and it shouldn't
> be.
> Similarity between items:
> 101     102     0.2612038749637414
> 101     103     0.4340578302732228
> 102     103     0.2600070276638468
>
> Matrix with preferences:
>             101      102        103
> 1                     1         0.1
> 2                     1         0.1
> 3                     1         0.1
> 4           1         1         0.1
> 5           1         1         0.1
> 6                     1         0.1
> 7                     1         0.1
> 8                     1         0.1
> 9                     1         0.1
> 10                    1         0.1
>
>
> Both examples were run without any additional parameters.
> Is this problem solved somewhere, somehow? Any ideas? Why null is treated
> as 0?
> Source: http://files.grouplens.org/papers/www10_sarwar.pdf
>
>
>
> Kind regards,
> Natalia Gruszowska
>
>
>