You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Phoenix Bai <ba...@gmail.com> on 2013/04/10 11:15:02 UTC

log-likelihood ratio value in item similarity calculation

Hi,

the counts for two events are:
* **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
k21=13**k22=300,000*
according to the code, I will get:

rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
matrixEntropy(entropy(7,8,13, 300,000) = 458

thus,

LLR=2.0*(458-222-152) = 168
similarityScore = 1 - 1/(1+168) = 0.994

So, my problem is,
the similarity scores I get for all the items are all this high and it
makes it so hard to identify the real similar ones.

As you can see, the counts of event A, and B are quite small while the
total count for k22 is quite high. And this phenomenon is quite common in
my dataset.

So, my question is,
what kind of adjustment could I do to lower the similarity score to a more
reasonable range?

Please shed some lights, thanks in advance!

Re: log-likelihood ratio value in item similarity calculation

Posted by Sean Owen <sr...@gmail.com>.
Yes using counts is more efficient. Certainly it makes the LLR value
different since the results are not normalized; all the input values are N
times larger (N = sum k), and so the LLR is N times larger. 2x more events
in the same ratio will make the LLR 2x larger too.

That's just fine if you're comparing values within the same system since
LLRs are all scaled by the same factor. The similarity values don't change
proportionally however and that does affect results. I think the effect is
generally small but it is not zero and not trivial, yes.


On Wed, Apr 10, 2013 at 11:18 AM, Phoenix Bai <ba...@gmail.com> wrote:

> Good point.
>
> btw, why use counts instead of probabilities? for easy and efficient
> implementation?
> also, do you think the similarity score using counts might quite differ
> from using probabilities?
>
> thank you very much for your prompt reply. [?]
>
>
> On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> These events do sound 'similar'. They occur together about half the
>> time either one of them occurs. You might have many pairs that end up
>> being similar for the same reason, and this is not surprising. They're
>> all "really" similar.
>>
>> The mapping here from LLR's range of [0,inf) to [0,1] is pretty
>> arbitrary, but it is an increasing function of LLR. So the ordering
>> you get is exactly the ordering LLR dictates. Yes you are going to get
>> a number of values near 1 at the top, but does it matter?
>>
>> LLR = 0 and similarity = 0 when the events appear perfectly
>> independent. For example, if A and B occur with probability 10%,
>> independently, then you might have k11 = 1, k12 = 9, k21 = 9, k22 =
>> 81. The matrix (joint probability) has no more info than the marginal
>> probabilities, so the matrix entropy == row entropy + col entropy and
>> LLR = 0.
>>
>>
>> On Wed, Apr 10, 2013 at 10:15 AM, Phoenix Bai <ba...@gmail.com> wrote:
>> > Hi,
>> >
>> > the counts for two events are:
>> > * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
>> > k21=13**k22=300,000*
>> > according to the code, I will get:
>> >
>> > rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
>> > colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
>> > matrixEntropy(entropy(7,8,13, 300,000) = 458
>> >
>> > thus,
>> >
>> > LLR=2.0*(458-222-152) = 168
>> > similarityScore = 1 - 1/(1+168) = 0.994
>> >
>> > So, my problem is,
>> > the similarity scores I get for all the items are all this high and it
>> > makes it so hard to identify the real similar ones.
>> >
>> > As you can see, the counts of event A, and B are quite small while the
>> > total count for k22 is quite high. And this phenomenon is quite common
>> in
>> > my dataset.
>> >
>> > So, my question is,
>> > what kind of adjustment could I do to lower the similarity score to a
>> more
>> > reasonable range?
>> >
>> > Please shed some lights, thanks in advance!
>>
>
>

Re: log-likelihood ratio value in item similarity calculation

Posted by Ted Dunning <te...@gmail.com>.
Counts are critical here.

Suppose that two rare events occur together the first time you ever see
them.  How exciting is this?  Not very in my mind, but not necessarily
trivial.

Now suppose that they occur together 20 times and never occur alone after
you have collected 20 times more data. This is a huge deal.

Without counts, you can't see the difference.




On Wed, Apr 10, 2013 at 3:18 AM, Phoenix Bai <ba...@gmail.com> wrote:

> Good point.
>
> btw, why use counts instead of probabilities? for easy and efficient
> implementation?
> also, do you think the similarity score using counts might quite differ
> from using probabilities?
>
> thank you very much for your prompt reply. [?]
>
>
> On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> These events do sound 'similar'. They occur together about half the
>> time either one of them occurs. You might have many pairs that end up
>> being similar for the same reason, and this is not surprising. They're
>> all "really" similar.
>>
>> The mapping here from LLR's range of [0,inf) to [0,1] is pretty
>> arbitrary, but it is an increasing function of LLR. So the ordering
>> you get is exactly the ordering LLR dictates. Yes you are going to get
>> a number of values near 1 at the top, but does it matter?
>>
>> LLR = 0 and similarity = 0 when the events appear perfectly
>> independent. For example, if A and B occur with probability 10%,
>> independently, then you might have k11 = 1, k12 = 9, k21 = 9, k22 =
>> 81. The matrix (joint probability) has no more info than the marginal
>> probabilities, so the matrix entropy == row entropy + col entropy and
>> LLR = 0.
>>
>>
>> On Wed, Apr 10, 2013 at 10:15 AM, Phoenix Bai <ba...@gmail.com> wrote:
>> > Hi,
>> >
>> > the counts for two events are:
>> > * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
>> > k21=13**k22=300,000*
>> > according to the code, I will get:
>> >
>> > rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
>> > colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
>> > matrixEntropy(entropy(7,8,13, 300,000) = 458
>> >
>> > thus,
>> >
>> > LLR=2.0*(458-222-152) = 168
>> > similarityScore = 1 - 1/(1+168) = 0.994
>> >
>> > So, my problem is,
>> > the similarity scores I get for all the items are all this high and it
>> > makes it so hard to identify the real similar ones.
>> >
>> > As you can see, the counts of event A, and B are quite small while the
>> > total count for k22 is quite high. And this phenomenon is quite common
>> in
>> > my dataset.
>> >
>> > So, my question is,
>> > what kind of adjustment could I do to lower the similarity score to a
>> more
>> > reasonable range?
>> >
>> > Please shed some lights, thanks in advance!
>>
>
>

Re: log-likelihood ratio value in item similarity calculation

Posted by Phoenix Bai <ba...@gmail.com>.
Good point.

btw, why use counts instead of probabilities? for easy and efficient
implementation?
also, do you think the similarity score using counts might quite differ
from using probabilities?

thank you very much for your prompt reply. [?]


On Wed, Apr 10, 2013 at 5:50 PM, Sean Owen <sr...@gmail.com> wrote:

> These events do sound 'similar'. They occur together about half the
> time either one of them occurs. You might have many pairs that end up
> being similar for the same reason, and this is not surprising. They're
> all "really" similar.
>
> The mapping here from LLR's range of [0,inf) to [0,1] is pretty
> arbitrary, but it is an increasing function of LLR. So the ordering
> you get is exactly the ordering LLR dictates. Yes you are going to get
> a number of values near 1 at the top, but does it matter?
>
> LLR = 0 and similarity = 0 when the events appear perfectly
> independent. For example, if A and B occur with probability 10%,
> independently, then you might have k11 = 1, k12 = 9, k21 = 9, k22 =
> 81. The matrix (joint probability) has no more info than the marginal
> probabilities, so the matrix entropy == row entropy + col entropy and
> LLR = 0.
>
>
> On Wed, Apr 10, 2013 at 10:15 AM, Phoenix Bai <ba...@gmail.com> wrote:
> > Hi,
> >
> > the counts for two events are:
> > * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
> > k21=13**k22=300,000*
> > according to the code, I will get:
> >
> > rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
> > colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
> > matrixEntropy(entropy(7,8,13, 300,000) = 458
> >
> > thus,
> >
> > LLR=2.0*(458-222-152) = 168
> > similarityScore = 1 - 1/(1+168) = 0.994
> >
> > So, my problem is,
> > the similarity scores I get for all the items are all this high and it
> > makes it so hard to identify the real similar ones.
> >
> > As you can see, the counts of event A, and B are quite small while the
> > total count for k22 is quite high. And this phenomenon is quite common in
> > my dataset.
> >
> > So, my question is,
> > what kind of adjustment could I do to lower the similarity score to a
> more
> > reasonable range?
> >
> > Please shed some lights, thanks in advance!
>

Re: log-likelihood ratio value in item similarity calculation

Posted by Sean Owen <sr...@gmail.com>.
These events do sound 'similar'. They occur together about half the
time either one of them occurs. You might have many pairs that end up
being similar for the same reason, and this is not surprising. They're
all "really" similar.

The mapping here from LLR's range of [0,inf) to [0,1] is pretty
arbitrary, but it is an increasing function of LLR. So the ordering
you get is exactly the ordering LLR dictates. Yes you are going to get
a number of values near 1 at the top, but does it matter?

LLR = 0 and similarity = 0 when the events appear perfectly
independent. For example, if A and B occur with probability 10%,
independently, then you might have k11 = 1, k12 = 9, k21 = 9, k22 =
81. The matrix (joint probability) has no more info than the marginal
probabilities, so the matrix entropy == row entropy + col entropy and
LLR = 0.


On Wed, Apr 10, 2013 at 10:15 AM, Phoenix Bai <ba...@gmail.com> wrote:
> Hi,
>
> the counts for two events are:
> * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
> k21=13**k22=300,000*
> according to the code, I will get:
>
> rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
> colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
> matrixEntropy(entropy(7,8,13, 300,000) = 458
>
> thus,
>
> LLR=2.0*(458-222-152) = 168
> similarityScore = 1 - 1/(1+168) = 0.994
>
> So, my problem is,
> the similarity scores I get for all the items are all this high and it
> makes it so hard to identify the real similar ones.
>
> As you can see, the counts of event A, and B are quite small while the
> total count for k22 is quite high. And this phenomenon is quite common in
> my dataset.
>
> So, my question is,
> what kind of adjustment could I do to lower the similarity score to a more
> reasonable range?
>
> Please shed some lights, thanks in advance!

Re: log-likelihood ratio value in item similarity calculation

Posted by Ted Dunning <te...@gmail.com>.
The only virtue of using the natural base is that you get a nice asymptotic
distribution for random data.




On Fri, Apr 12, 2013 at 1:10 AM, Sean Owen <sr...@gmail.com> wrote:

> Yes that's true, it is more usually bits. Here it's natural log / nats.
> Since it's unnormalized anyway another constant factor doesn't hurt and it
> means not having to change the base.
>
>
> On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai <ba...@gmail.com> wrote:
>
>> I got 168, because I use log base 2 instead of e.
>> ([?]) if memory serves right, I read it in entropy definition that people
>> normally use base 2, so I just assumed it was 2 in code. (my bad).
>>
>> And now I have a better understanding, so thank you both for the
>> explanation.
>>
>>

Re: log-likelihood ratio value in item similarity calculation

Posted by Sean Owen <sr...@gmail.com>.
Yes that's true, it is more usually bits. Here it's natural log / nats.
Since it's unnormalized anyway another constant factor doesn't hurt and it
means not having to change the base.


On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai <ba...@gmail.com> wrote:

> I got 168, because I use log base 2 instead of e.
> ([?]) if memory serves right, I read it in entropy definition that people
> normally use base 2, so I just assumed it was 2 in code. (my bad).
>
> And now I have a better understanding, so thank you both for the
> explanation.
>
>

Re: log-likelihood ratio value in item similarity calculation

Posted by Phoenix Bai <ba...@gmail.com>.
I got 168, because I use log base 2 instead of e.
([?]) if memory serves right, I read it in entropy definition that people
normally use base 2, so I just assumed it was 2 in code. (my bad).

And now I have a better understanding, so thank you both for the
explanation.


On Fri, Apr 12, 2013 at 6:01 AM, Sean Owen <sr...@gmail.com> wrote:

> Yes I also get (er, Mahout gets) 117 (116.69), FWIW.
>
> I think the second question concerned counts vs relative frequencies
> -- normalized, or not. Like whether you divide all the counts by their
> sum or not. For a fixed set of observations that does change the LLR
> because it is unnormalized, not because the situation has changed.
>
> Obviously you're right that the changing situations you describe do
> entail a change in LLR!
>
> On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > These numbers don't match what I get.
> >
> > I get LLR = 117.
> >
> > This is wildly anomalous so this pair should definitely be connected.
>  Both
> > items are quite rare (15/300,000 or 20/300,000 rates) but they occur
> > together most of the time that they appear.
> >
> >
> >
> > On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai <ba...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> the counts for two events are:
> >> * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
> >> k21=13**k22=300,000*
> >> according to the code, I will get:
> >>
> >> rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
> >> colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
> >> matrixEntropy(entropy(7,8,13, 300,000) = 458
> >>
> >> thus,
> >>
> >> LLR=2.0*(458-222-152) = 168
> >> similarityScore = 1 - 1/(1+168) = 0.994
> >>
> >> So, my problem is,
> >> the similarity scores I get for all the items are all this high and it
> >> makes it so hard to identify the real similar ones.
> >>
> >> As you can see, the counts of event A, and B are quite small while the
> >> total count for k22 is quite high. And this phenomenon is quite common
> in
> >> my dataset.
> >>
> >> So, my question is,
> >> what kind of adjustment could I do to lower the similarity score to a
> more
> >> reasonable range?
> >>
> >> Please shed some lights, thanks in advance!
> >>
>

Re: log-likelihood ratio value in item similarity calculation

Posted by Sean Owen <sr...@gmail.com>.
Yes I also get (er, Mahout gets) 117 (116.69), FWIW.

I think the second question concerned counts vs relative frequencies
-- normalized, or not. Like whether you divide all the counts by their
sum or not. For a fixed set of observations that does change the LLR
because it is unnormalized, not because the situation has changed.

Obviously you're right that the changing situations you describe do
entail a change in LLR!

On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning <te...@gmail.com> wrote:
> These numbers don't match what I get.
>
> I get LLR = 117.
>
> This is wildly anomalous so this pair should definitely be connected.  Both
> items are quite rare (15/300,000 or 20/300,000 rates) but they occur
> together most of the time that they appear.
>
>
>
> On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai <ba...@gmail.com> wrote:
>
>> Hi,
>>
>> the counts for two events are:
>> * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
>> k21=13**k22=300,000*
>> according to the code, I will get:
>>
>> rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
>> colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
>> matrixEntropy(entropy(7,8,13, 300,000) = 458
>>
>> thus,
>>
>> LLR=2.0*(458-222-152) = 168
>> similarityScore = 1 - 1/(1+168) = 0.994
>>
>> So, my problem is,
>> the similarity scores I get for all the items are all this high and it
>> makes it so hard to identify the real similar ones.
>>
>> As you can see, the counts of event A, and B are quite small while the
>> total count for k22 is quite high. And this phenomenon is quite common in
>> my dataset.
>>
>> So, my question is,
>> what kind of adjustment could I do to lower the similarity score to a more
>> reasonable range?
>>
>> Please shed some lights, thanks in advance!
>>

Re: log-likelihood ratio value in item similarity calculation

Posted by Ted Dunning <te...@gmail.com>.
These numbers don't match what I get.

I get LLR = 117.

This is wildly anomalous so this pair should definitely be connected.  Both
items are quite rare (15/300,000 or 20/300,000 rates) but they occur
together most of the time that they appear.



On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai <ba...@gmail.com> wrote:

> Hi,
>
> the counts for two events are:
> * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
> k21=13**k22=300,000*
> according to the code, I will get:
>
> rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
> colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
> matrixEntropy(entropy(7,8,13, 300,000) = 458
>
> thus,
>
> LLR=2.0*(458-222-152) = 168
> similarityScore = 1 - 1/(1+168) = 0.994
>
> So, my problem is,
> the similarity scores I get for all the items are all this high and it
> makes it so hard to identify the real similar ones.
>
> As you can see, the counts of event A, and B are quite small while the
> total count for k22 is quite high. And this phenomenon is quite common in
> my dataset.
>
> So, my question is,
> what kind of adjustment could I do to lower the similarity score to a more
> reasonable range?
>
> Please shed some lights, thanks in advance!
>