You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Mario Levitin <ma...@gmail.com> on 2014/04/28 00:30:41 UTC

Understanding LogLikelihood Similarity

Hi,

I've used LogLikelihood Similarity in user based nearest neighborhood
collaborative filtering and it has given good results (better than the
others).

I have read the blog post by Ted Dunning (
http://tdunning.blogspot.com.tr/2008/03/surprise-and-coincidence.html) also
looked at the implementation in Mahout. However, I still do not understand
"why" this similarity metric works.

I'm trying to give it a probabilistic interpretation in order to understand
the logic behind. Any probabilistic interpretation should define random
variables, events, etc. However, my attempts in this respect have been
unsuccessful.

Any help will be appreciated.
Thanks

Re: Understanding LogLikelihood Similarity

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
sorry, clicked on wrong thread. please disregard.


On Wed, Apr 30, 2014 at 4:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> sure. I assume this should include statements that something crushes
> something without providing a link to a published analysis of what it is
> something that crushes something another and due to what something.
>
>
> On Wed, Apr 30, 2014 at 4:21 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> OK.
>>
>> Whether a user has interacted with A is a sample from a binomial
>> distribution with an unknown parameter p_A.  Likewise with B and p_B.  The
>> two binomial distributions may or may not be independent.
>>
>> The LLR is measuring the degree evidence against independence.
>>
>>
>>
>>
>> On Thu, May 1, 2014 at 12:50 AM, Mario Levitin <mariolevitin@gmail.com
>> >wrote:
>>
>> > Ted, I understand how the contingency table is constructed, and how to
>> > compute the LLR value. What I cannot understand is how to link this with
>> > binomial distributions.
>> >
>> >
>> > On Thu, May 1, 2014 at 1:02 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> > > The contingency table is constructed by looking at how many users have
>> > > expressed preference or interest in two items.  If the items are A
>> and B,
>> > > the pertinent counts are
>> > >
>> > > k11 - the number of users who interacted with both A and B
>> > > k12 - the number of users who interacted with A but not B
>> > > k21 - the number of users who interacted with B but not A
>> > > k22 - the number of users who interacted with neither A nor B.
>> > >
>> > > These values are values that go into the contingency table and are all
>> > that
>> > > is needed to compute the LLR value.
>> > >
>> > > See
>> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.htmlfor
>> > > a
>> > > detailed description.
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Apr 30, 2014 at 11:31 PM, Mario Levitin <
>> mariolevitin@gmail.com
>> > > >wrote:
>> > >
>> > > > Hi Ted,
>> > > > I have read the paper. I understand the "Likelihood Ratio for
>> Binomial
>> > > > Distributions" part.
>> > > > However, I cannot make a connection with this part and the
>> contingency
>> > > > table.
>> > > >
>> > > > In order to calculate Likelihood Ratio for two Binomial
>> Distributions
>> > you
>> > > > need the values: p, p1, p2, k1, k2, n1, n2.
>> > > > But the information contained in the contingency table are different
>> > from
>> > > > these values. So, again, I do not understand how the information
>> > > contained
>> > > > in the contingency table is linked with Likelihood Ratio for
>> Binomial
>> > > > Distributions.
>> > > >
>> > > > In order to find the similarity between two users I tend to think of
>> > the
>> > > > boolean preferences of user1 as a sample from a binomial
>> distribution
>> > and
>> > > > the boolean preferences of user2 as another sample from a binomial
>> > > > distribution. Then use the LLR to assess how likely these
>> distributions
>> > > are
>> > > > the same. But I don't think this is correct since this calculation
>> does
>> > > not
>> > > > use the contingency table.
>> > > >
>> > > > I hope my question is clear.
>> > > > Thanks.
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Apr 28, 2014 at 2:41 AM, Ted Dunning <ted.dunning@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Excellent.  Look forward to hearing your reactions.
>> > > > >
>> > > > > On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <
>> > mariolevitin@gmail.com
>> > > > > >wrote:
>> > > > >
>> > > > > > Not yet, but I will.
>> > > > > >
>> > > > > > >
>> > > > > > > Have you read my original paper on the topic of LLR?  It
>> explains
>> > > the
>> > > > > > > connection with chi^2 measures of association.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Understanding LogLikelihood Similarity

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
sure. I assume this should include statements that something crushes
something without providing a link to a published analysis of what it is
something that crushes something another and due to what something.


On Wed, Apr 30, 2014 at 4:21 PM, Ted Dunning <te...@gmail.com> wrote:

> OK.
>
> Whether a user has interacted with A is a sample from a binomial
> distribution with an unknown parameter p_A.  Likewise with B and p_B.  The
> two binomial distributions may or may not be independent.
>
> The LLR is measuring the degree evidence against independence.
>
>
>
>
> On Thu, May 1, 2014 at 12:50 AM, Mario Levitin <mariolevitin@gmail.com
> >wrote:
>
> > Ted, I understand how the contingency table is constructed, and how to
> > compute the LLR value. What I cannot understand is how to link this with
> > binomial distributions.
> >
> >
> > On Thu, May 1, 2014 at 1:02 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > The contingency table is constructed by looking at how many users have
> > > expressed preference or interest in two items.  If the items are A and
> B,
> > > the pertinent counts are
> > >
> > > k11 - the number of users who interacted with both A and B
> > > k12 - the number of users who interacted with A but not B
> > > k21 - the number of users who interacted with B but not A
> > > k22 - the number of users who interacted with neither A nor B.
> > >
> > > These values are values that go into the contingency table and are all
> > that
> > > is needed to compute the LLR value.
> > >
> > > See
> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.htmlfor
> > > a
> > > detailed description.
> > >
> > >
> > >
> > >
> > > On Wed, Apr 30, 2014 at 11:31 PM, Mario Levitin <
> mariolevitin@gmail.com
> > > >wrote:
> > >
> > > > Hi Ted,
> > > > I have read the paper. I understand the "Likelihood Ratio for
> Binomial
> > > > Distributions" part.
> > > > However, I cannot make a connection with this part and the
> contingency
> > > > table.
> > > >
> > > > In order to calculate Likelihood Ratio for two Binomial Distributions
> > you
> > > > need the values: p, p1, p2, k1, k2, n1, n2.
> > > > But the information contained in the contingency table are different
> > from
> > > > these values. So, again, I do not understand how the information
> > > contained
> > > > in the contingency table is linked with Likelihood Ratio for Binomial
> > > > Distributions.
> > > >
> > > > In order to find the similarity between two users I tend to think of
> > the
> > > > boolean preferences of user1 as a sample from a binomial distribution
> > and
> > > > the boolean preferences of user2 as another sample from a binomial
> > > > distribution. Then use the LLR to assess how likely these
> distributions
> > > are
> > > > the same. But I don't think this is correct since this calculation
> does
> > > not
> > > > use the contingency table.
> > > >
> > > > I hope my question is clear.
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > On Mon, Apr 28, 2014 at 2:41 AM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Excellent.  Look forward to hearing your reactions.
> > > > >
> > > > > On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <
> > mariolevitin@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Not yet, but I will.
> > > > > >
> > > > > > >
> > > > > > > Have you read my original paper on the topic of LLR?  It
> explains
> > > the
> > > > > > > connection with chi^2 measures of association.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Understanding LogLikelihood Similarity

Posted by Mario Levitin <ma...@gmail.com>.
After some thought now I'm in a better situation, but I think it will take
some more effort to fully digest it.
Thanks for your answers Ted.



On Thu, May 1, 2014 at 2:21 AM, Ted Dunning <te...@gmail.com> wrote:

> OK.
>
> Whether a user has interacted with A is a sample from a binomial
> distribution with an unknown parameter p_A.  Likewise with B and p_B.  The
> two binomial distributions may or may not be independent.
>
> The LLR is measuring the degree evidence against independence.
>
>
>
>
> On Thu, May 1, 2014 at 12:50 AM, Mario Levitin <mariolevitin@gmail.com
> >wrote:
>
> > Ted, I understand how the contingency table is constructed, and how to
> > compute the LLR value. What I cannot understand is how to link this with
> > binomial distributions.
> >
> >
> > On Thu, May 1, 2014 at 1:02 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > The contingency table is constructed by looking at how many users have
> > > expressed preference or interest in two items.  If the items are A and
> B,
> > > the pertinent counts are
> > >
> > > k11 - the number of users who interacted with both A and B
> > > k12 - the number of users who interacted with A but not B
> > > k21 - the number of users who interacted with B but not A
> > > k22 - the number of users who interacted with neither A nor B.
> > >
> > > These values are values that go into the contingency table and are all
> > that
> > > is needed to compute the LLR value.
> > >
> > > See
> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.htmlfor
> > > a
> > > detailed description.
> > >
> > >
> > >
> > >
> > > On Wed, Apr 30, 2014 at 11:31 PM, Mario Levitin <
> mariolevitin@gmail.com
> > > >wrote:
> > >
> > > > Hi Ted,
> > > > I have read the paper. I understand the "Likelihood Ratio for
> Binomial
> > > > Distributions" part.
> > > > However, I cannot make a connection with this part and the
> contingency
> > > > table.
> > > >
> > > > In order to calculate Likelihood Ratio for two Binomial Distributions
> > you
> > > > need the values: p, p1, p2, k1, k2, n1, n2.
> > > > But the information contained in the contingency table are different
> > from
> > > > these values. So, again, I do not understand how the information
> > > contained
> > > > in the contingency table is linked with Likelihood Ratio for Binomial
> > > > Distributions.
> > > >
> > > > In order to find the similarity between two users I tend to think of
> > the
> > > > boolean preferences of user1 as a sample from a binomial distribution
> > and
> > > > the boolean preferences of user2 as another sample from a binomial
> > > > distribution. Then use the LLR to assess how likely these
> distributions
> > > are
> > > > the same. But I don't think this is correct since this calculation
> does
> > > not
> > > > use the contingency table.
> > > >
> > > > I hope my question is clear.
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > On Mon, Apr 28, 2014 at 2:41 AM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Excellent.  Look forward to hearing your reactions.
> > > > >
> > > > > On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <
> > mariolevitin@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Not yet, but I will.
> > > > > >
> > > > > > >
> > > > > > > Have you read my original paper on the topic of LLR?  It
> explains
> > > the
> > > > > > > connection with chi^2 measures of association.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Understanding LogLikelihood Similarity

Posted by Ted Dunning <te...@gmail.com>.
OK.

Whether a user has interacted with A is a sample from a binomial
distribution with an unknown parameter p_A.  Likewise with B and p_B.  The
two binomial distributions may or may not be independent.

The LLR is measuring the degree evidence against independence.




On Thu, May 1, 2014 at 12:50 AM, Mario Levitin <ma...@gmail.com>wrote:

> Ted, I understand how the contingency table is constructed, and how to
> compute the LLR value. What I cannot understand is how to link this with
> binomial distributions.
>
>
> On Thu, May 1, 2014 at 1:02 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > The contingency table is constructed by looking at how many users have
> > expressed preference or interest in two items.  If the items are A and B,
> > the pertinent counts are
> >
> > k11 - the number of users who interacted with both A and B
> > k12 - the number of users who interacted with A but not B
> > k21 - the number of users who interacted with B but not A
> > k22 - the number of users who interacted with neither A nor B.
> >
> > These values are values that go into the contingency table and are all
> that
> > is needed to compute the LLR value.
> >
> > See http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.htmlfor
> > a
> > detailed description.
> >
> >
> >
> >
> > On Wed, Apr 30, 2014 at 11:31 PM, Mario Levitin <mariolevitin@gmail.com
> > >wrote:
> >
> > > Hi Ted,
> > > I have read the paper. I understand the "Likelihood Ratio for Binomial
> > > Distributions" part.
> > > However, I cannot make a connection with this part and the contingency
> > > table.
> > >
> > > In order to calculate Likelihood Ratio for two Binomial Distributions
> you
> > > need the values: p, p1, p2, k1, k2, n1, n2.
> > > But the information contained in the contingency table are different
> from
> > > these values. So, again, I do not understand how the information
> > contained
> > > in the contingency table is linked with Likelihood Ratio for Binomial
> > > Distributions.
> > >
> > > In order to find the similarity between two users I tend to think of
> the
> > > boolean preferences of user1 as a sample from a binomial distribution
> and
> > > the boolean preferences of user2 as another sample from a binomial
> > > distribution. Then use the LLR to assess how likely these distributions
> > are
> > > the same. But I don't think this is correct since this calculation does
> > not
> > > use the contingency table.
> > >
> > > I hope my question is clear.
> > > Thanks.
> > >
> > >
> > >
> > > On Mon, Apr 28, 2014 at 2:41 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Excellent.  Look forward to hearing your reactions.
> > > >
> > > > On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <
> mariolevitin@gmail.com
> > > > >wrote:
> > > >
> > > > > Not yet, but I will.
> > > > >
> > > > > >
> > > > > > Have you read my original paper on the topic of LLR?  It explains
> > the
> > > > > > connection with chi^2 measures of association.
> > > > >
> > > >
> > >
> >
>

Re: Understanding LogLikelihood Similarity

Posted by Mario Levitin <ma...@gmail.com>.
Ted, I understand how the contingency table is constructed, and how to
compute the LLR value. What I cannot understand is how to link this with
binomial distributions.


On Thu, May 1, 2014 at 1:02 AM, Ted Dunning <te...@gmail.com> wrote:

> The contingency table is constructed by looking at how many users have
> expressed preference or interest in two items.  If the items are A and B,
> the pertinent counts are
>
> k11 - the number of users who interacted with both A and B
> k12 - the number of users who interacted with A but not B
> k21 - the number of users who interacted with B but not A
> k22 - the number of users who interacted with neither A nor B.
>
> These values are values that go into the contingency table and are all that
> is needed to compute the LLR value.
>
> See http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html for
> a
> detailed description.
>
>
>
>
> On Wed, Apr 30, 2014 at 11:31 PM, Mario Levitin <mariolevitin@gmail.com
> >wrote:
>
> > Hi Ted,
> > I have read the paper. I understand the "Likelihood Ratio for Binomial
> > Distributions" part.
> > However, I cannot make a connection with this part and the contingency
> > table.
> >
> > In order to calculate Likelihood Ratio for two Binomial Distributions you
> > need the values: p, p1, p2, k1, k2, n1, n2.
> > But the information contained in the contingency table are different from
> > these values. So, again, I do not understand how the information
> contained
> > in the contingency table is linked with Likelihood Ratio for Binomial
> > Distributions.
> >
> > In order to find the similarity between two users I tend to think of the
> > boolean preferences of user1 as a sample from a binomial distribution and
> > the boolean preferences of user2 as another sample from a binomial
> > distribution. Then use the LLR to assess how likely these distributions
> are
> > the same. But I don't think this is correct since this calculation does
> not
> > use the contingency table.
> >
> > I hope my question is clear.
> > Thanks.
> >
> >
> >
> > On Mon, Apr 28, 2014 at 2:41 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Excellent.  Look forward to hearing your reactions.
> > >
> > > On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <mariolevitin@gmail.com
> > > >wrote:
> > >
> > > > Not yet, but I will.
> > > >
> > > > >
> > > > > Have you read my original paper on the topic of LLR?  It explains
> the
> > > > > connection with chi^2 measures of association.
> > > >
> > >
> >
>

Re: Understanding LogLikelihood Similarity

Posted by Ted Dunning <te...@gmail.com>.
The contingency table is constructed by looking at how many users have
expressed preference or interest in two items.  If the items are A and B,
the pertinent counts are

k11 - the number of users who interacted with both A and B
k12 - the number of users who interacted with A but not B
k21 - the number of users who interacted with B but not A
k22 - the number of users who interacted with neither A nor B.

These values are values that go into the contingency table and are all that
is needed to compute the LLR value.

See http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html for a
detailed description.




On Wed, Apr 30, 2014 at 11:31 PM, Mario Levitin <ma...@gmail.com>wrote:

> Hi Ted,
> I have read the paper. I understand the "Likelihood Ratio for Binomial
> Distributions" part.
> However, I cannot make a connection with this part and the contingency
> table.
>
> In order to calculate Likelihood Ratio for two Binomial Distributions you
> need the values: p, p1, p2, k1, k2, n1, n2.
> But the information contained in the contingency table are different from
> these values. So, again, I do not understand how the information contained
> in the contingency table is linked with Likelihood Ratio for Binomial
> Distributions.
>
> In order to find the similarity between two users I tend to think of the
> boolean preferences of user1 as a sample from a binomial distribution and
> the boolean preferences of user2 as another sample from a binomial
> distribution. Then use the LLR to assess how likely these distributions are
> the same. But I don't think this is correct since this calculation does not
> use the contingency table.
>
> I hope my question is clear.
> Thanks.
>
>
>
> On Mon, Apr 28, 2014 at 2:41 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Excellent.  Look forward to hearing your reactions.
> >
> > On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <mariolevitin@gmail.com
> > >wrote:
> >
> > > Not yet, but I will.
> > >
> > > >
> > > > Have you read my original paper on the topic of LLR?  It explains the
> > > > connection with chi^2 measures of association.
> > >
> >
>

Re: Understanding LogLikelihood Similarity

Posted by Mario Levitin <ma...@gmail.com>.
Hi Ted,
I have read the paper. I understand the "Likelihood Ratio for Binomial
Distributions" part.
However, I cannot make a connection with this part and the contingency
table.

In order to calculate Likelihood Ratio for two Binomial Distributions you
need the values: p, p1, p2, k1, k2, n1, n2.
But the information contained in the contingency table are different from
these values. So, again, I do not understand how the information contained
in the contingency table is linked with Likelihood Ratio for Binomial
Distributions.

In order to find the similarity between two users I tend to think of the
boolean preferences of user1 as a sample from a binomial distribution and
the boolean preferences of user2 as another sample from a binomial
distribution. Then use the LLR to assess how likely these distributions are
the same. But I don't think this is correct since this calculation does not
use the contingency table.

I hope my question is clear.
Thanks.



On Mon, Apr 28, 2014 at 2:41 AM, Ted Dunning <te...@gmail.com> wrote:

> Excellent.  Look forward to hearing your reactions.
>
> On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <mariolevitin@gmail.com
> >wrote:
>
> > Not yet, but I will.
> >
> > >
> > > Have you read my original paper on the topic of LLR?  It explains the
> > > connection with chi^2 measures of association.
> >
>

Re: Understanding LogLikelihood Similarity

Posted by Ted Dunning <te...@gmail.com>.
Excellent.  Look forward to hearing your reactions.

On Mon, Apr 28, 2014 at 1:14 AM, Mario Levitin <ma...@gmail.com>wrote:

> Not yet, but I will.
>
> >
> > Have you read my original paper on the topic of LLR?  It explains the
> > connection with chi^2 measures of association.
>

Re: Understanding LogLikelihood Similarity

Posted by Mario Levitin <ma...@gmail.com>.
Not yet, but I will.
Thanks


On Mon, Apr 28, 2014 at 2:01 AM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Apr 28, 2014 at 12:30 AM, Mario Levitin <mariolevitin@gmail.com
> >wrote:
>
> > I'm trying to give it a probabilistic interpretation in order to
> understand
> > the logic behind. Any probabilistic interpretation should define random
> > variables, events, etc. However, my attempts in this respect have been
> > unsuccessful.
> >
>
> Mario,
>
> Good to see you here.
>
> Have you read my original paper on the topic of LLR?  It explains the
> connection with chi^2 measures of association.
>
> The paper can be found here: http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf
>
> It is very short and fairly understandable.  Modern implementations are
> simpler than the formulae in the paper, but the narrative should be pretty
> clear.
>
> If you have (or once you have) read that paper, could you say why that
> description doesn't meet your needs for a probabilistic interpretation?  If
> so, I would imagine I or others will be able to help further.
>

Re: Understanding LogLikelihood Similarity

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Apr 28, 2014 at 12:30 AM, Mario Levitin <ma...@gmail.com>wrote:

> I'm trying to give it a probabilistic interpretation in order to understand
> the logic behind. Any probabilistic interpretation should define random
> variables, events, etc. However, my attempts in this respect have been
> unsuccessful.
>

Mario,

Good to see you here.

Have you read my original paper on the topic of LLR?  It explains the
connection with chi^2 measures of association.

The paper can be found here: http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf

It is very short and fairly understandable.  Modern implementations are
simpler than the formulae in the paper, but the narrative should be pretty
clear.

If you have (or once you have) read that paper, could you say why that
description doesn't meet your needs for a probabilistic interpretation?  If
so, I would imagine I or others will be able to help further.