You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by aishsesh <ai...@bloomreach.com> on 2014/09/10 22:55:02 UTC

LogLikelihoodSimilarity calculation

Hi,

I have the following case where numItems = 1,000,000, prefs1Size = 900,000
and prefs2Size = 100.

It is the case when i have two users, one who has seen 90% of the movies in
the database and another only 100 of the million movies. Suppose they have
90 movies in common (user 2 has seen only 100 movies totally), i would
assume the similarity to be high compared to when they have only 10 movies
in common. But the similarities i am getting are 
0.9971 for intersection size 10 and 
0 for intersection size 90.

This seems counter intuitive. 

Am i missing something? Is there an explanation for the above mentioned
values?



--
View this message in context: http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: LogLikelihoodSimilarity calculation

Posted by Pat Ferrel <pa...@gmail.com>.

Actually this is a great example of why LLR works. Intuitively speaking what do you know about the taste of someone who prefers 90% of items? Nothing, they watch them all. What value are the cooccurrences in movie watches? None in fact I’d look to see if pref1 isn’t an anomaly caused by a crawler or something. Intuitively speaking LLR found the non-diferentiating user and properly ignored them.

On Sep 10, 2014, at 8:43 PM, Ted Dunning <te...@gmail.com> wrote:

It might help to look at the matrices that result:

First I defined a function in R to generate the contingency tables:

> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
n-n1-n2+k11),
nrow=2)}

One of your examples is this one

> f(90)
    [,1]   [,2]
[1,]   90 899910
[2,]   10  99990

Notice how the two columns are basically the same except for a scaling
factor.

Here is your other example

> f(10)
    [,1]   [,2]
[1,]   10 899990
[2,]   90  99910

Now what we have is that in the first column, row 2 is bigger while in the
second column, row 1 is bigger.  That is, the distributions are quite
different.

Here is the actual LLR score for the first example:

> llr(f(90))
[1] -1.275022e-10

(the negative sign is spurious and hte result of round-off error.  The real
result is basically just 0)

And for the second:

> llr(f(10))
[1] 351.6271

Here we see a huge value which says that (as we saw), the distributions are
different.

For reference, here is the R code for llr:

> llr
function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> H
function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}

On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
aishwarya.srivastava@bloomreach.com> wrote:

> Hi Dmitriy,
> 
> I am following the same calculation used in the userSimilarity method in
> LogLikelihoodSimilarity.java
> 
> k11 = intersectionSize       (both users viewed movie)
> 
> k12 = prefs2Size - intersectionSize   (only viewed by user 2)
> 
> k21 = prefs1Size - intersectionSize    (only viewed by user 1)
> 
> k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed by
> both 1 and 2)
> 
> 
> Thanks,
> 
> Aishwarya
> 
> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> 
>> how do you compute k11, k12... values exactly?
>> 
>> On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
>> aishwarya.srivastava@bloomreach.com> wrote:
>> 
>>> Hi,
>>> 
>>> I have the following case where numItems = 1,000,000, prefs1Size =
>> 900,000
>>> and prefs2Size = 100.
>>> 
>>> It is the case when i have two users, one who has seen 90% of the
> movies
>> in
>>> the database and another only 100 of the million movies. Suppose they
>> have
>>> 90 movies in common (user 2 has seen only 100 movies totally), i would
>>> assume the similarity to be high compared to when they have only 10
>> movies
>>> in common. But the similarities i am getting are
>>> 0.9971 for intersection size 10 and
>>> 0 for intersection size 90.
>>> 
>>> This seems counter intuitive.
>>> 
>>> Am i missing something? Is there an explanation for the above mentioned
>>> values?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> 
>> 
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>> 
>> 
>

Re: LogLikelihoodSimilarity calculation

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Sep 19, 2014 at 3:29 AM, <ma...@gmail.com> wrote:

> I see that the more I write, the more useless statements I make


I am glad to hear that there are more members of this club than just
myself.  Welcome!

In fact, your questions and statements have been helpful.  If you have a
question because our documentation is lacking, there are others in the same
situation.

Re: LogLikelihoodSimilarity calculation

Posted by Ted Dunning <te...@gmail.com>.

Aishwarya,

The two matrices in question are this one for overlap == 10,

     [,1]   [,2]
[1,]   10 899990
[2,]   90  99910

and this one for overlap == 20

     [,1]   [,2]
[1,]   20 899980
[2,]   80  99920

Column 1 here represents prefs1, column 2 represents not(prefs1), row 1
represents prefs2, row2 represents not(prefs2).

As you can see in the first case, the two columns are different in trend.
The first column has a top element that is smaller and the second column
has a top element which is larger.  This is a highly unusual situation for
cooccurrence, of course, because most things are actually kind of rare.

So what is happening here is that as you move from intersection of 10 to
intersection of 20, the distributions in column 1 and column 2 are becoming
*more* alike than before.  Thus the score goes from
ultra-super-hyper-mega-massive score of 350 to the merely
ultra-super-hyper-mega score of 270.

A more interesting example is where you have two items with each occur only
100 times of of the million in total.  Now if these two co-occur 10 times,
we get this:

     [,1]   [,2]
[1,]   10     90
[2,]   90 999810

which gives and LLR score of 120.  This represents a huge score because two
events which occur 1/10,000 of the time occur in the presence of the other
at a rate of 10%.  This is 1000x lift.

For a cooccurrence of 20, we get this matrix:

     [,1]   [,2]
[1,]   20     90
[2,]   90 999800

And now the LLR goes to 264.  The lift in frequency is now 2000x so the
score is much higher.

Does this help?




On Fri, Sep 26, 2014 at 11:55 AM, Aishwarya Srivastava <
aishwarya.srivastava@bloomreach.com> wrote:

> Hi Ted & Pat,
>
> Thank you so much for your answers. I can see now why it makes sense that
> in my first case (intersection size 90), LLR returns zero.
>
> But with the second case i still don't understand one thing. To restate the
> problem,
> I have the following case where numItems = 1,000,000, prefs1Size =
> 900,000, prefs2Size
> = 100 and intersection size is 10.
> The matrix that Ted generated is
>      [,1]   [,2]
> [1,]   10 899990
> [2,]   90  99910
>
> The LLR for this case is 351.6270674569532. I understand this.
>
> But if the users had so little in common, is this not dissimilarity?
>
> In Mahout today i am getting a score of 0.9971.
>
> If however the intersection size is 20, then the values for  are 272.6 for
> LLR and mahout's similarity score is 0.9963.
>
> I get how the LLR should decrease. But is an intersection of 20 not more
> similar than an intersection of 10.
>
> Sorry if i am missing something very obvious.
>
> Thanks,
> Aishwarya
>
>
> On Tue, Sep 23, 2014 at 3:03 AM, <ma...@gmail.com> wrote:
>
> > > Can you correct me if so?
> >
> > This makes me doubt about the correctness of my point:) The best is
> > possibly to write down a small example, with numbers and formulae. It's
> the
> > best way either to see if I missed the point, or there is actually a
> subtle
> > truth....
> >
> > I'll soon get back to the group with something less abstract -thanks
> again!
> >
> > On Sun, Sep 21, 2014 at 10:06 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Fri, Sep 19, 2014 at 3:29 AM, <ma...@gmail.com> wrote:
> > >
> > > > So my question was -shouldn't we consider both the frequency
> > distribution
> > > > of item sales *and* of users purchases in the same formula? Am I
> > correct
> > > if
> > > > I say that this does not happen when we compute the contingency table
> > (if
> > > > we build the contingency table for two users, we do not consider the
> > > > frequency distribution of book sales, and vice versa), right?
> > > >
> > > > That said, I am fully aware that mine is a mainly academic question,
> as
> > > the
> > > > LLR makes anyway a fantastic job....!
> > > >
> > >
> > > As I understand it, I believe that LLR does what your want since it
> knows
> > > the overall frequency of the user and the item in question.
> > >
> > > What is does not do directly is include information about how *other*
> > users
> > > and *other* items are distributed except in aggregate.
> > >
> > > On the other hand, when you rank these LLR scores for a single user,
> you
> > do
> > > incorporate evidence from all other items (relative to that single
> user).
> > >
> > > I think that your point is actually quite subtle and I may have missed
> > the
> > > point.  Can you correct me if so?
> > >
> >
>

Re: LogLikelihoodSimilarity calculation

Posted by Aishwarya Srivastava <ai...@bloomreach.com>.

Hi Ted & Pat,

Thank you so much for your answers. I can see now why it makes sense that
in my first case (intersection size 90), LLR returns zero.

But with the second case i still don't understand one thing. To restate the
problem,
I have the following case where numItems = 1,000,000, prefs1Size =
900,000, prefs2Size
= 100 and intersection size is 10.
The matrix that Ted generated is
     [,1]   [,2]
[1,]   10 899990
[2,]   90  99910

The LLR for this case is 351.6270674569532. I understand this.

But if the users had so little in common, is this not dissimilarity?

In Mahout today i am getting a score of 0.9971.

If however the intersection size is 20, then the values for  are 272.6 for
LLR and mahout's similarity score is 0.9963.

I get how the LLR should decrease. But is an intersection of 20 not more
similar than an intersection of 10.

Sorry if i am missing something very obvious.

Thanks,
Aishwarya


On Tue, Sep 23, 2014 at 3:03 AM, <ma...@gmail.com> wrote:

> > Can you correct me if so?
>
> This makes me doubt about the correctness of my point:) The best is
> possibly to write down a small example, with numbers and formulae. It's the
> best way either to see if I missed the point, or there is actually a subtle
> truth....
>
> I'll soon get back to the group with something less abstract -thanks again!
>
> On Sun, Sep 21, 2014 at 10:06 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Fri, Sep 19, 2014 at 3:29 AM, <ma...@gmail.com> wrote:
> >
> > > So my question was -shouldn't we consider both the frequency
> distribution
> > > of item sales *and* of users purchases in the same formula? Am I
> correct
> > if
> > > I say that this does not happen when we compute the contingency table
> (if
> > > we build the contingency table for two users, we do not consider the
> > > frequency distribution of book sales, and vice versa), right?
> > >
> > > That said, I am fully aware that mine is a mainly academic question, as
> > the
> > > LLR makes anyway a fantastic job....!
> > >
> >
> > As I understand it, I believe that LLR does what your want since it knows
> > the overall frequency of the user and the item in question.
> >
> > What is does not do directly is include information about how *other*
> users
> > and *other* items are distributed except in aggregate.
> >
> > On the other hand, when you rank these LLR scores for a single user, you
> do
> > incorporate evidence from all other items (relative to that single user).
> >
> > I think that your point is actually quite subtle and I may have missed
> the
> > point.  Can you correct me if so?
> >
>

Re: LogLikelihoodSimilarity calculation

Posted by ma...@gmail.com.

> Can you correct me if so?

This makes me doubt about the correctness of my point:) The best is
possibly to write down a small example, with numbers and formulae. It's the
best way either to see if I missed the point, or there is actually a subtle
truth....

I'll soon get back to the group with something less abstract -thanks again!

On Sun, Sep 21, 2014 at 10:06 PM, Ted Dunning <te...@gmail.com> wrote:

> On Fri, Sep 19, 2014 at 3:29 AM, <ma...@gmail.com> wrote:
>
> > So my question was -shouldn't we consider both the frequency distribution
> > of item sales *and* of users purchases in the same formula? Am I correct
> if
> > I say that this does not happen when we compute the contingency table (if
> > we build the contingency table for two users, we do not consider the
> > frequency distribution of book sales, and vice versa), right?
> >
> > That said, I am fully aware that mine is a mainly academic question, as
> the
> > LLR makes anyway a fantastic job....!
> >
>
> As I understand it, I believe that LLR does what your want since it knows
> the overall frequency of the user and the item in question.
>
> What is does not do directly is include information about how *other* users
> and *other* items are distributed except in aggregate.
>
> On the other hand, when you rank these LLR scores for a single user, you do
> incorporate evidence from all other items (relative to that single user).
>
> I think that your point is actually quite subtle and I may have missed the
> point.  Can you correct me if so?
>

Re: LogLikelihoodSimilarity calculation

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Sep 19, 2014 at 3:29 AM, <ma...@gmail.com> wrote:

> So my question was -shouldn't we consider both the frequency distribution
> of item sales *and* of users purchases in the same formula? Am I correct if
> I say that this does not happen when we compute the contingency table (if
> we build the contingency table for two users, we do not consider the
> frequency distribution of book sales, and vice versa), right?
>
> That said, I am fully aware that mine is a mainly academic question, as the
> LLR makes anyway a fantastic job....!
>

As I understand it, I believe that LLR does what your want since it knows
the overall frequency of the user and the item in question.

What is does not do directly is include information about how *other* users
and *other* items are distributed except in aggregate.

On the other hand, when you rank these LLR scores for a single user, you do
incorporate evidence from all other items (relative to that single user).

I think that your point is actually quite subtle and I may have missed the
point.  Can you correct me if so?

Re: LogLikelihoodSimilarity calculation

Posted by ma...@gmail.com.

Hi All

Pat and Ted, thank you very much for your answers! As usual, much
appreciated.

I see that the more I write, the more useless statements I make -such as
"[User] would most probably take 100 best sellers...". Ted, you are
obviously right, it's not the case.

Regarding the second point, it is not that I think that high overlap causes
problems. I just would like to compute the likelihood that a certain
overlap is obtained by chance or because the two users are similar or
dissimilar.  This was my concern computing similarities between crowd-based
book lists. Some lists had only very few books, and some of those lists had
books in common: they were the "intellectual" lists with rare, highly
informative books, therefore even a few books in common were significative.
Other lists, with hundreds of books (e.g. the "free ebooks" list), had the
same books in common, but in this case the similarity weight given by the
same "rare" books had to be lower, because the lists were different -they
were "pop" lists.

So my question was -shouldn't we consider both the frequency distribution
of item sales *and* of users purchases in the same formula? Am I correct if
I say that this does not happen when we compute the contingency table (if
we build the contingency table for two users, we do not consider the
frequency distribution of book sales, and vice versa), right?

That said, I am fully aware that mine is a mainly academic question, as the
LLR makes anyway a fantastic job....!

Thanks again for your time (and for doing such a great job with Mahout to
Spark:) )
Mario



On Sun, Sep 14, 2014 at 8:01 PM, Ted Dunning <te...@gmail.com> wrote:

> Mario,
>
> Your questions are good.  And the answers, such as they are, bear repeating
> and elaboration.
>
> I see several basic points in what you write.  Selecting from this, I would
> highlight the following:
>
> 1) random-walk users picking items according to overall frequency will tend
> to pick the same very popular items
>
> 2) high overlap of this kind will cause problems with recommendation
>
>
> To the first point, the probability that an undirected user picks only
> common items is actually quite low.  The key property of a long-tailed
> frequency distribution is exactly that the long-tail has very significant
> probability mass.  The high-frequency head items have significant mass as
> well, but they do not dominate.  Any user selecting 100 items will
> necessary pick some high frequency items, some mid-frequency items and some
> rare (i.e. long-tail) items.
>
> With respect to the second point, I would point out that the premise is
> flawed so the question is already addressed.  But even if we consider
> further, there are two issues with the assertion.  The first and most
> significant is the unstated assumption that overlap between user histories
> is what we are looking for.  In fact, with LLR, we are not doing that at
> all.  We are looking at overlap of users in the item history, corrected
> according to underlying frequency.  This means that even if 80 or 90% of
> the items selected by our random walk users are from the same small set of
> high-frequency items (and they will not be), then we still don't really
> have a problem.  It just means that we will be spending too many cycles on
> the analysis of things we could find out quickly.
>
> The second issue with this is that even for the high frequency items, if
> they have not tendency toward cooccurrence beyond what is expected by their
> high underlying frequency, the LLR score will not be unusually high.  Now,
> high frequency items often do have small correlations in their occurrences
> and since they are abnormally sensitive by virtue of their high frequency,
> this can lead to a few items being commonly marked as indicators for
> others.
>
> This is also not much of a problem because the search engine will
> down-weight such common indicators.
>
> Does this help?  I know that I cherry-picked those of your questions that I
> have the strongest answers for, but it seems to me that they are also the
> most fundamental questions as well.
>
>
>
>
> On Sat, Sep 13, 2014 at 12:28 AM, <ma...@gmail.com> wrote:
>
> > Hi All
> >
> > One consideration. If we assume that all books have the same probability
> of
> > being bought, K11=90 has no significance for the recommendation, as
> > rightfully comes out from LLR. The probability of having a certain K11
> from
> > the point of view of User1 is binomial(K11, 100, 0.1) which has a max at
> > 90. So the likelihood that the 90 books in common are there by chance is
> at
> > its maximum.
> >
> > K11=10 is, on the contrary, significative. It says that users 1 and 2 are
> > *dissimilar*. The probability of user 2 picking only 10 books in common
> > with user 1 is actually quite low (binomially, 10 success out of 100
> trials
> > with p=90%, i.e. ~E-78).
> >
> > LLR makes this clear too. Still, I wonder if it is possible in some way
> to
> > take in consideration that *not all books have the same probability of
> > being bought*. The Pareto-nature of book selling makes so that if User2
> > buys 100 books randomly, according to their sales-frequency, it would
> most
> > probably take 100 best sellers -and not the long tail and would result
> with
> > a computable (but unknown to me) number of books in common with User1.
> > Therefore it is important not just to consider how many books User1 has
> in
> > common with User2, but also *which* books are in common. For two users
> with
> > 100 books, having 10 blockbusters in common is not significative, having
> > even 10 extremely rare books in common is quite significative. But this
> > would not come out from the computation of the LLR for two users. It does
> > come out when we compute the similarity between two books -but then, in
> > this case, we do not consider the amount of books bought by each one of
> the
> > users who bought the two books.
> >
> > (Ted, I am sorry to bring this topic up again, after the comment on your
> > blog, but every time I use the LLR -for item or user similarity- this
> > question always comes up in my mind, and I cannot see in the formulae how
> > it is addressed)
> >
> > Regards,
> > Mario
> >
> >
> >
> > On Thu, Sep 11, 2014 at 5:43 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > It might help to look at the matrices that result:
> > >
> > > First I defined a function in R to generate the contingency tables:
> > >
> > > > f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11,
> > n2-k11,
> > > n-n1-n2+k11),
> > > nrow=2)}
> > >
> > > One of your examples is this one
> > >
> > > > f(90)
> > >      [,1]   [,2]
> > > [1,]   90 899910
> > > [2,]   10  99990
> > >
> > > Notice how the two columns are basically the same except for a scaling
> > > factor.
> > >
> > > Here is your other example
> > >
> > > > f(10)
> > >      [,1]   [,2]
> > > [1,]   10 899990
> > > [2,]   90  99910
> > >
> > > Now what we have is that in the first column, row 2 is bigger while in
> > the
> > > second column, row 1 is bigger.  That is, the distributions are quite
> > > different.
> > >
> > > Here is the actual LLR score for the first example:
> > >
> > > > llr(f(90))
> > > [1] -1.275022e-10
> > >
> > > (the negative sign is spurious and hte result of round-off error.  The
> > real
> > > result is basically just 0)
> > >
> > > And for the second:
> > >
> > > > llr(f(10))
> > > [1] 351.6271
> > >
> > > Here we see a huge value which says that (as we saw), the distributions
> > are
> > > different.
> > >
> > > For reference, here is the R code for llr:
> > >
> > > > llr
> > > function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> > > > H
> > > function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
> > >
> > >
> > > On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
> > > aishwarya.srivastava@bloomreach.com> wrote:
> > >
> > > > Hi Dmitriy,
> > > >
> > > > I am following the same calculation used in the userSimilarity method
> > in
> > > > LogLikelihoodSimilarity.java
> > > >
> > > > k11 = intersectionSize       (both users viewed movie)
> > > >
> > > > k12 = prefs2Size - intersectionSize   (only viewed by user 2)
> > > >
> > > > k21 = prefs1Size - intersectionSize    (only viewed by user 1)
> > > >
> > > > k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not
> viewed
> > > by
> > > > both 1 and 2)
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Aishwarya
> > > >
> > > > On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > how do you compute k11, k12... values exactly?
> > > > >
> > > > > On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> > > > > aishwarya.srivastava@bloomreach.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have the following case where numItems = 1,000,000, prefs1Size
> =
> > > > > 900,000
> > > > > > and prefs2Size = 100.
> > > > > >
> > > > > > It is the case when i have two users, one who has seen 90% of the
> > > > movies
> > > > > in
> > > > > > the database and another only 100 of the million movies. Suppose
> > they
> > > > > have
> > > > > > 90 movies in common (user 2 has seen only 100 movies totally), i
> > > would
> > > > > > assume the similarity to be high compared to when they have only
> 10
> > > > > movies
> > > > > > in common. But the similarities i am getting are
> > > > > > 0.9971 for intersection size 10 and
> > > > > > 0 for intersection size 90.
> > > > > >
> > > > > > This seems counter intuitive.
> > > > > >
> > > > > > Am i missing something? Is there an explanation for the above
> > > mentioned
> > > > > > values?
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > > > > > Sent from the Mahout User List mailing list archive at
> Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: LogLikelihoodSimilarity calculation

Posted by Ted Dunning <te...@gmail.com>.

Mario,

Your questions are good.  And the answers, such as they are, bear repeating
and elaboration.

I see several basic points in what you write.  Selecting from this, I would
highlight the following:

1) random-walk users picking items according to overall frequency will tend
to pick the same very popular items

2) high overlap of this kind will cause problems with recommendation


To the first point, the probability that an undirected user picks only
common items is actually quite low.  The key property of a long-tailed
frequency distribution is exactly that the long-tail has very significant
probability mass.  The high-frequency head items have significant mass as
well, but they do not dominate.  Any user selecting 100 items will
necessary pick some high frequency items, some mid-frequency items and some
rare (i.e. long-tail) items.

With respect to the second point, I would point out that the premise is
flawed so the question is already addressed.  But even if we consider
further, there are two issues with the assertion.  The first and most
significant is the unstated assumption that overlap between user histories
is what we are looking for.  In fact, with LLR, we are not doing that at
all.  We are looking at overlap of users in the item history, corrected
according to underlying frequency.  This means that even if 80 or 90% of
the items selected by our random walk users are from the same small set of
high-frequency items (and they will not be), then we still don't really
have a problem.  It just means that we will be spending too many cycles on
the analysis of things we could find out quickly.

The second issue with this is that even for the high frequency items, if
they have not tendency toward cooccurrence beyond what is expected by their
high underlying frequency, the LLR score will not be unusually high.  Now,
high frequency items often do have small correlations in their occurrences
and since they are abnormally sensitive by virtue of their high frequency,
this can lead to a few items being commonly marked as indicators for others.

This is also not much of a problem because the search engine will
down-weight such common indicators.

Does this help?  I know that I cherry-picked those of your questions that I
have the strongest answers for, but it seems to me that they are also the
most fundamental questions as well.




On Sat, Sep 13, 2014 at 12:28 AM, <ma...@gmail.com> wrote:

> Hi All
>
> One consideration. If we assume that all books have the same probability of
> being bought, K11=90 has no significance for the recommendation, as
> rightfully comes out from LLR. The probability of having a certain K11 from
> the point of view of User1 is binomial(K11, 100, 0.1) which has a max at
> 90. So the likelihood that the 90 books in common are there by chance is at
> its maximum.
>
> K11=10 is, on the contrary, significative. It says that users 1 and 2 are
> *dissimilar*. The probability of user 2 picking only 10 books in common
> with user 1 is actually quite low (binomially, 10 success out of 100 trials
> with p=90%, i.e. ~E-78).
>
> LLR makes this clear too. Still, I wonder if it is possible in some way to
> take in consideration that *not all books have the same probability of
> being bought*. The Pareto-nature of book selling makes so that if User2
> buys 100 books randomly, according to their sales-frequency, it would most
> probably take 100 best sellers -and not the long tail and would result with
> a computable (but unknown to me) number of books in common with User1.
> Therefore it is important not just to consider how many books User1 has in
> common with User2, but also *which* books are in common. For two users with
> 100 books, having 10 blockbusters in common is not significative, having
> even 10 extremely rare books in common is quite significative. But this
> would not come out from the computation of the LLR for two users. It does
> come out when we compute the similarity between two books -but then, in
> this case, we do not consider the amount of books bought by each one of the
> users who bought the two books.
>
> (Ted, I am sorry to bring this topic up again, after the comment on your
> blog, but every time I use the LLR -for item or user similarity- this
> question always comes up in my mind, and I cannot see in the formulae how
> it is addressed)
>
> Regards,
> Mario
>
>
>
> On Thu, Sep 11, 2014 at 5:43 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > It might help to look at the matrices that result:
> >
> > First I defined a function in R to generate the contingency tables:
> >
> > > f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11,
> n2-k11,
> > n-n1-n2+k11),
> > nrow=2)}
> >
> > One of your examples is this one
> >
> > > f(90)
> >      [,1]   [,2]
> > [1,]   90 899910
> > [2,]   10  99990
> >
> > Notice how the two columns are basically the same except for a scaling
> > factor.
> >
> > Here is your other example
> >
> > > f(10)
> >      [,1]   [,2]
> > [1,]   10 899990
> > [2,]   90  99910
> >
> > Now what we have is that in the first column, row 2 is bigger while in
> the
> > second column, row 1 is bigger.  That is, the distributions are quite
> > different.
> >
> > Here is the actual LLR score for the first example:
> >
> > > llr(f(90))
> > [1] -1.275022e-10
> >
> > (the negative sign is spurious and hte result of round-off error.  The
> real
> > result is basically just 0)
> >
> > And for the second:
> >
> > > llr(f(10))
> > [1] 351.6271
> >
> > Here we see a huge value which says that (as we saw), the distributions
> are
> > different.
> >
> > For reference, here is the R code for llr:
> >
> > > llr
> > function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> > > H
> > function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
> >
> >
> > On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
> > aishwarya.srivastava@bloomreach.com> wrote:
> >
> > > Hi Dmitriy,
> > >
> > > I am following the same calculation used in the userSimilarity method
> in
> > > LogLikelihoodSimilarity.java
> > >
> > > k11 = intersectionSize       (both users viewed movie)
> > >
> > > k12 = prefs2Size - intersectionSize   (only viewed by user 2)
> > >
> > > k21 = prefs1Size - intersectionSize    (only viewed by user 1)
> > >
> > > k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed
> > by
> > > both 1 and 2)
> > >
> > >
> > > Thanks,
> > >
> > > Aishwarya
> > >
> > > On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > how do you compute k11, k12... values exactly?
> > > >
> > > > On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> > > > aishwarya.srivastava@bloomreach.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have the following case where numItems = 1,000,000, prefs1Size =
> > > > 900,000
> > > > > and prefs2Size = 100.
> > > > >
> > > > > It is the case when i have two users, one who has seen 90% of the
> > > movies
> > > > in
> > > > > the database and another only 100 of the million movies. Suppose
> they
> > > > have
> > > > > 90 movies in common (user 2 has seen only 100 movies totally), i
> > would
> > > > > assume the similarity to be high compared to when they have only 10
> > > > movies
> > > > > in common. But the similarities i am getting are
> > > > > 0.9971 for intersection size 10 and
> > > > > 0 for intersection size 90.
> > > > >
> > > > > This seems counter intuitive.
> > > > >
> > > > > Am i missing something? Is there an explanation for the above
> > mentioned
> > > > > values?
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > > > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > > > >
> > > >
> > >
> >
>

Re: LogLikelihoodSimilarity calculation

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sorry if this is a repeat, never sure if I’m using the right email address.

On Sep 14, 2014, at 7:49 AM, Pat Ferrel <pa...@gmail.com> wrote:

They key phrase is frequency. If we use the search engine method for returning recs we are using LLR to find significant cooccurrences—as a filter. Then the final step of indexing and querying for recs they are (or can be) TF-IDF weighted and cosine similarity is used for the query. This will down-weight universally popular items. If, for some reason, you want to favor blockbusters turn off norms and TF-IDF to leave the high weights for popular items.

Does this address your issue? 

On Sep 13, 2014, at 12:28 AM, mario.alemi@gmail.com wrote:

Hi All

One consideration. If we assume that all books have the same probability of
being bought, K11=90 has no significance for the recommendation, as
rightfully comes out from LLR. The probability of having a certain K11 from
the point of view of User1 is binomial(K11, 100, 0.1) which has a max at
90. So the likelihood that the 90 books in common are there by chance is at
its maximum.

K11=10 is, on the contrary, significative. It says that users 1 and 2 are
*dissimilar*. The probability of user 2 picking only 10 books in common
with user 1 is actually quite low (binomially, 10 success out of 100 trials
with p=90%, i.e. ~E-78).

LLR makes this clear too. Still, I wonder if it is possible in some way to
take in consideration that *not all books have the same probability of
being bought*. The Pareto-nature of book selling makes so that if User2
buys 100 books randomly, according to their sales-frequency, it would most
probably take 100 best sellers -and not the long tail and would result with
a computable (but unknown to me) number of books in common with User1.
Therefore it is important not just to consider how many books User1 has in
common with User2, but also *which* books are in common. For two users with
100 books, having 10 blockbusters in common is not significative, having
even 10 extremely rare books in common is quite significative. But this
would not come out from the computation of the LLR for two users. It does
come out when we compute the similarity between two books -but then, in
this case, we do not consider the amount of books bought by each one of the
users who bought the two books.

(Ted, I am sorry to bring this topic up again, after the comment on your
blog, but every time I use the LLR -for item or user similarity- this
question always comes up in my mind, and I cannot see in the formulae how
it is addressed)

Regards,
Mario

On Thu, Sep 11, 2014 at 5:43 AM, Ted Dunning <te...@gmail.com> wrote:

> It might help to look at the matrices that result:
> 
> First I defined a function in R to generate the contingency tables:
> 
>> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
> n-n1-n2+k11),
> nrow=2)}
> 
> One of your examples is this one
> 
>> f(90)
>    [,1]   [,2]
> [1,]   90 899910
> [2,]   10  99990
> 
> Notice how the two columns are basically the same except for a scaling
> factor.
> 
> Here is your other example
> 
>> f(10)
>    [,1]   [,2]
> [1,]   10 899990
> [2,]   90  99910
> 
> Now what we have is that in the first column, row 2 is bigger while in the
> second column, row 1 is bigger.  That is, the distributions are quite
> different.
> 
> Here is the actual LLR score for the first example:
> 
>> llr(f(90))
> [1] -1.275022e-10
> 
> (the negative sign is spurious and hte result of round-off error.  The real
> result is basically just 0)
> 
> And for the second:
> 
>> llr(f(10))
> [1] 351.6271
> 
> Here we see a huge value which says that (as we saw), the distributions are
> different.
> 
> For reference, here is the R code for llr:
> 
>> llr
> function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
>> H
> function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
> 
> 
> On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
> aishwarya.srivastava@bloomreach.com> wrote:
> 
>> Hi Dmitriy,
>> 
>> I am following the same calculation used in the userSimilarity method in
>> LogLikelihoodSimilarity.java
>> 
>> k11 = intersectionSize       (both users viewed movie)
>> 
>> k12 = prefs2Size - intersectionSize   (only viewed by user 2)
>> 
>> k21 = prefs1Size - intersectionSize    (only viewed by user 1)
>> 
>> k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed
> by
>> both 1 and 2)
>> 
>> 
>> Thanks,
>> 
>> Aishwarya
>> 
>> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> 
>>> how do you compute k11, k12... values exactly?
>>> 
>>> On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
>>> aishwarya.srivastava@bloomreach.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I have the following case where numItems = 1,000,000, prefs1Size =
>>> 900,000
>>>> and prefs2Size = 100.
>>>> 
>>>> It is the case when i have two users, one who has seen 90% of the
>> movies
>>> in
>>>> the database and another only 100 of the million movies. Suppose they
>>> have
>>>> 90 movies in common (user 2 has seen only 100 movies totally), i
> would
>>>> assume the similarity to be high compared to when they have only 10
>>> movies
>>>> in common. But the similarities i am getting are
>>>> 0.9971 for intersection size 10 and
>>>> 0 for intersection size 90.
>>>> 
>>>> This seems counter intuitive.
>>>> 
>>>> Am i missing something? Is there an explanation for the above
> mentioned
>>>> values?
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> 
>>> 
>> 
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>> 
>>> 
>> 
>

Re: LogLikelihoodSimilarity calculation

Posted by Pat Ferrel <pa...@gmail.com>.

They key phrase is frequency. If we use the search engine method for returning recs we are using LLR to find significant cooccurrences—as a filter. Then the final step of indexing and querying for recs they are (or can be) TF-IDF weighted and cosine similarity is used for the query. This will down-weight universally popular items. If, for some reason, you want to favor blockbusters turn off norms and TF-IDF to leave the high weights for popular items.

Does this address your issue? 

On Sep 13, 2014, at 12:28 AM, mario.alemi@gmail.com wrote:

Hi All

One consideration. If we assume that all books have the same probability of
being bought, K11=90 has no significance for the recommendation, as
rightfully comes out from LLR. The probability of having a certain K11 from
the point of view of User1 is binomial(K11, 100, 0.1) which has a max at
90. So the likelihood that the 90 books in common are there by chance is at
its maximum.

K11=10 is, on the contrary, significative. It says that users 1 and 2 are
*dissimilar*. The probability of user 2 picking only 10 books in common
with user 1 is actually quite low (binomially, 10 success out of 100 trials
with p=90%, i.e. ~E-78).

LLR makes this clear too. Still, I wonder if it is possible in some way to
take in consideration that *not all books have the same probability of
being bought*. The Pareto-nature of book selling makes so that if User2
buys 100 books randomly, according to their sales-frequency, it would most
probably take 100 best sellers -and not the long tail and would result with
a computable (but unknown to me) number of books in common with User1.
Therefore it is important not just to consider how many books User1 has in
common with User2, but also *which* books are in common. For two users with
100 books, having 10 blockbusters in common is not significative, having
even 10 extremely rare books in common is quite significative. But this
would not come out from the computation of the LLR for two users. It does
come out when we compute the similarity between two books -but then, in
this case, we do not consider the amount of books bought by each one of the
users who bought the two books.

(Ted, I am sorry to bring this topic up again, after the comment on your
blog, but every time I use the LLR -for item or user similarity- this
question always comes up in my mind, and I cannot see in the formulae how
it is addressed)

Regards,
Mario

On Thu, Sep 11, 2014 at 5:43 AM, Ted Dunning <te...@gmail.com> wrote:

> It might help to look at the matrices that result:
> 
> First I defined a function in R to generate the contingency tables:
> 
>> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
> n-n1-n2+k11),
> nrow=2)}
> 
> One of your examples is this one
> 
>> f(90)
>     [,1]   [,2]
> [1,]   90 899910
> [2,]   10  99990
> 
> Notice how the two columns are basically the same except for a scaling
> factor.
> 
> Here is your other example
> 
>> f(10)
>     [,1]   [,2]
> [1,]   10 899990
> [2,]   90  99910
> 
> Now what we have is that in the first column, row 2 is bigger while in the
> second column, row 1 is bigger.  That is, the distributions are quite
> different.
> 
> Here is the actual LLR score for the first example:
> 
>> llr(f(90))
> [1] -1.275022e-10
> 
> (the negative sign is spurious and hte result of round-off error.  The real
> result is basically just 0)
> 
> And for the second:
> 
>> llr(f(10))
> [1] 351.6271
> 
> Here we see a huge value which says that (as we saw), the distributions are
> different.
> 
> For reference, here is the R code for llr:
> 
>> llr
> function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
>> H
> function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
> 
> 
> On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
> aishwarya.srivastava@bloomreach.com> wrote:
> 
>> Hi Dmitriy,
>> 
>> I am following the same calculation used in the userSimilarity method in
>> LogLikelihoodSimilarity.java
>> 
>> k11 = intersectionSize       (both users viewed movie)
>> 
>> k12 = prefs2Size - intersectionSize   (only viewed by user 2)
>> 
>> k21 = prefs1Size - intersectionSize    (only viewed by user 1)
>> 
>> k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed
> by
>> both 1 and 2)
>> 
>> 
>> Thanks,
>> 
>> Aishwarya
>> 
>> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> 
>>> how do you compute k11, k12... values exactly?
>>> 
>>> On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
>>> aishwarya.srivastava@bloomreach.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I have the following case where numItems = 1,000,000, prefs1Size =
>>> 900,000
>>>> and prefs2Size = 100.
>>>> 
>>>> It is the case when i have two users, one who has seen 90% of the
>> movies
>>> in
>>>> the database and another only 100 of the million movies. Suppose they
>>> have
>>>> 90 movies in common (user 2 has seen only 100 movies totally), i
> would
>>>> assume the similarity to be high compared to when they have only 10
>>> movies
>>>> in common. But the similarities i am getting are
>>>> 0.9971 for intersection size 10 and
>>>> 0 for intersection size 90.
>>>> 
>>>> This seems counter intuitive.
>>>> 
>>>> Am i missing something? Is there an explanation for the above
> mentioned
>>>> values?
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> 
>>> 
>> 
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>> 
>>> 
>> 
>

Re: LogLikelihoodSimilarity calculation

Posted by ma...@gmail.com.

Hi All

One consideration. If we assume that all books have the same probability of
being bought, K11=90 has no significance for the recommendation, as
rightfully comes out from LLR. The probability of having a certain K11 from
the point of view of User1 is binomial(K11, 100, 0.1) which has a max at
90. So the likelihood that the 90 books in common are there by chance is at
its maximum.

K11=10 is, on the contrary, significative. It says that users 1 and 2 are
*dissimilar*. The probability of user 2 picking only 10 books in common
with user 1 is actually quite low (binomially, 10 success out of 100 trials
with p=90%, i.e. ~E-78).

LLR makes this clear too. Still, I wonder if it is possible in some way to
take in consideration that *not all books have the same probability of
being bought*. The Pareto-nature of book selling makes so that if User2
buys 100 books randomly, according to their sales-frequency, it would most
probably take 100 best sellers -and not the long tail and would result with
a computable (but unknown to me) number of books in common with User1.
Therefore it is important not just to consider how many books User1 has in
common with User2, but also *which* books are in common. For two users with
100 books, having 10 blockbusters in common is not significative, having
even 10 extremely rare books in common is quite significative. But this
would not come out from the computation of the LLR for two users. It does
come out when we compute the similarity between two books -but then, in
this case, we do not consider the amount of books bought by each one of the
users who bought the two books.

(Ted, I am sorry to bring this topic up again, after the comment on your
blog, but every time I use the LLR -for item or user similarity- this
question always comes up in my mind, and I cannot see in the formulae how
it is addressed)

Regards,
Mario

On Thu, Sep 11, 2014 at 5:43 AM, Ted Dunning <te...@gmail.com> wrote:

> It might help to look at the matrices that result:
>
> First I defined a function in R to generate the contingency tables:
>
> > f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
> n-n1-n2+k11),
> nrow=2)}
>
> One of your examples is this one
>
> > f(90)
>      [,1]   [,2]
> [1,]   90 899910
> [2,]   10  99990
>
> Notice how the two columns are basically the same except for a scaling
> factor.
>
> Here is your other example
>
> > f(10)
>      [,1]   [,2]
> [1,]   10 899990
> [2,]   90  99910
>
> Now what we have is that in the first column, row 2 is bigger while in the
> second column, row 1 is bigger.  That is, the distributions are quite
> different.
>
> Here is the actual LLR score for the first example:
>
> > llr(f(90))
> [1] -1.275022e-10
>
> (the negative sign is spurious and hte result of round-off error.  The real
> result is basically just 0)
>
> And for the second:
>
> > llr(f(10))
> [1] 351.6271
>
> Here we see a huge value which says that (as we saw), the distributions are
> different.
>
> For reference, here is the R code for llr:
>
> > llr
> function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> > H
> function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}
>
>
> On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
> aishwarya.srivastava@bloomreach.com> wrote:
>
> > Hi Dmitriy,
> >
> > I am following the same calculation used in the userSimilarity method in
> > LogLikelihoodSimilarity.java
> >
> > k11 = intersectionSize       (both users viewed movie)
> >
> > k12 = prefs2Size - intersectionSize   (only viewed by user 2)
> >
> > k21 = prefs1Size - intersectionSize    (only viewed by user 1)
> >
> > k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed
> by
> > both 1 and 2)
> >
> >
> > Thanks,
> >
> > Aishwarya
> >
> > On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > how do you compute k11, k12... values exactly?
> > >
> > > On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> > > aishwarya.srivastava@bloomreach.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have the following case where numItems = 1,000,000, prefs1Size =
> > > 900,000
> > > > and prefs2Size = 100.
> > > >
> > > > It is the case when i have two users, one who has seen 90% of the
> > movies
> > > in
> > > > the database and another only 100 of the million movies. Suppose they
> > > have
> > > > 90 movies in common (user 2 has seen only 100 movies totally), i
> would
> > > > assume the similarity to be high compared to when they have only 10
> > > movies
> > > > in common. But the similarities i am getting are
> > > > 0.9971 for intersection size 10 and
> > > > 0 for intersection size 90.
> > > >
> > > > This seems counter intuitive.
> > > >
> > > > Am i missing something? Is there an explanation for the above
> mentioned
> > > > values?
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > > >
> > >
> >
>

Re: LogLikelihoodSimilarity calculation

Posted by Ted Dunning <te...@gmail.com>.

It might help to look at the matrices that result:

First I defined a function in R to generate the contingency tables:

> f = function(k11, n1=100, n2=900000, n=1e6){matrix(c(k11, n1-k11, n2-k11,
n-n1-n2+k11),
nrow=2)}

One of your examples is this one

> f(90)
     [,1]   [,2]
[1,]   90 899910
[2,]   10  99990

Notice how the two columns are basically the same except for a scaling
factor.

Here is your other example

> f(10)
     [,1]   [,2]
[1,]   10 899990
[2,]   90  99910

Now what we have is that in the first column, row 2 is bigger while in the
second column, row 1 is bigger.  That is, the distributions are quite
different.

Here is the actual LLR score for the first example:

> llr(f(90))
[1] -1.275022e-10

(the negative sign is spurious and hte result of round-off error.  The real
result is basically just 0)

And for the second:

> llr(f(10))
[1] 351.6271

Here we see a huge value which says that (as we saw), the distributions are
different.

For reference, here is the R code for llr:

> llr
function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
> H
function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}


On Wed, Sep 10, 2014 at 2:32 PM, Aishwarya Srivastava <
aishwarya.srivastava@bloomreach.com> wrote:

> Hi Dmitriy,
>
> I am following the same calculation used in the userSimilarity method in
> LogLikelihoodSimilarity.java
>
> k11 = intersectionSize       (both users viewed movie)
>
> k12 = prefs2Size - intersectionSize   (only viewed by user 2)
>
> k21 = prefs1Size - intersectionSize    (only viewed by user 1)
>
> k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed by
> both 1 and 2)
>
>
> Thanks,
>
> Aishwarya
>
> On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > how do you compute k11, k12... values exactly?
> >
> > On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> > aishwarya.srivastava@bloomreach.com> wrote:
> >
> > > Hi,
> > >
> > > I have the following case where numItems = 1,000,000, prefs1Size =
> > 900,000
> > > and prefs2Size = 100.
> > >
> > > It is the case when i have two users, one who has seen 90% of the
> movies
> > in
> > > the database and another only 100 of the million movies. Suppose they
> > have
> > > 90 movies in common (user 2 has seen only 100 movies totally), i would
> > > assume the similarity to be high compared to when they have only 10
> > movies
> > > in common. But the similarities i am getting are
> > > 0.9971 for intersection size 10 and
> > > 0 for intersection size 90.
> > >
> > > This seems counter intuitive.
> > >
> > > Am i missing something? Is there an explanation for the above mentioned
> > > values?
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > >
> >
>

Re: LogLikelihoodSimilarity calculation

Posted by Aishwarya Srivastava <ai...@bloomreach.com>.

Hi Dmitriy,

I am following the same calculation used in the userSimilarity method in
LogLikelihoodSimilarity.java

k11 = intersectionSize       (both users viewed movie)

k12 = prefs2Size - intersectionSize   (only viewed by user 2)

k21 = prefs1Size - intersectionSize    (only viewed by user 1)

k22 = numItems- prefs1Size - prefs2Size + intersectionSize  (not viewed by
both 1 and 2)


Thanks,

Aishwarya

On Wed, Sep 10, 2014 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> how do you compute k11, k12... values exactly?
>
> On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
> aishwarya.srivastava@bloomreach.com> wrote:
>
> > Hi,
> >
> > I have the following case where numItems = 1,000,000, prefs1Size =
> 900,000
> > and prefs2Size = 100.
> >
> > It is the case when i have two users, one who has seen 90% of the movies
> in
> > the database and another only 100 of the million movies. Suppose they
> have
> > 90 movies in common (user 2 has seen only 100 movies totally), i would
> > assume the similarity to be high compared to when they have only 10
> movies
> > in common. But the similarities i am getting are
> > 0.9971 for intersection size 10 and
> > 0 for intersection size 90.
> >
> > This seems counter intuitive.
> >
> > Am i missing something? Is there an explanation for the above mentioned
> > values?
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> > Sent from the Mahout User List mailing list archive at Nabble.com.
> >
>

Re: LogLikelihoodSimilarity calculation

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

how do you compute k11, k12... values exactly?

On Wed, Sep 10, 2014 at 1:55 PM, aishsesh <
aishwarya.srivastava@bloomreach.com> wrote:

> Hi,
>
> I have the following case where numItems = 1,000,000, prefs1Size = 900,000
> and prefs2Size = 100.
>
> It is the case when i have two users, one who has seen 90% of the movies in
> the database and another only 100 of the million movies. Suppose they have
> 90 movies in common (user 2 has seen only 100 movies totally), i would
> assume the similarity to be high compared to when they have only 10 movies
> in common. But the similarities i am getting are
> 0.9971 for intersection size 10 and
> 0 for intersection size 90.
>
> This seems counter intuitive.
>
> Am i missing something? Is there an explanation for the above mentioned
> values?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/LogLikelihoodSimilarity-calculation-tp4158035.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>