You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Parimi Rohit <ro...@gmail.com> on 2014/09/30 01:24:35 UTC

Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Hi,

I am exploring a random-walk based algorithm for recommender systems which
works by propagating the item preferences for users on the user-user graph.
To do this, I have to compute user-user similarity and form a neighborhood.
I have tried the following three simple techniques to compute the score
between two users and find the neighborhood.

1. Score = (Common Items between users A and B) / (items preferred by A +
items Preferred by B)
2. Scoring based on Mahout's Cosine Similarity
3. Scoring based on Mahout's LogLikelihood similarity.

My understanding is that similarity based on LogLikelihood is more robust,
however, I get better results using the naive approach (technique 1 from
the above list). The problems I am addressing are collaborator
recommendation, conference recommendation and reference recommendation and
the data has implicit feedback.

So, my questions is, are there any cases where cosine similarity and
loglikelihood metrics fail (to capture similarity), for example, for the
problems stated above, users only collaborate with few other users (based
on area of interest), publish in only few conferences (again based on area
of interest) and refer to publications in a specific domain. So, the
preference counts are fairly small compared to other domains (music/video
etc).

Secondly, for CosineSimilarity, should I treat the preferences as boolean
or use the counts? (I think loglikelihood metric does not take into account
the preference counts.. correct me if I am wrong.)

Any insight into this is much appreciated.

Thanks,
Rohit

p.s. Ted, Pat: I am following the discussion on the thread
"LogLikelihoodSimilarity Calculation" and your answers helped me a lot to
understand how it works and made me wonder why things are different in my
case.

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Parimi Rohit <ro...@gmail.com>.

Thanks Ted! Will look into it.

Rohit

On Wed, Oct 1, 2014 at 1:04 AM, Ted Dunning <te...@gmail.com> wrote:

> Here is a paper that includes an analysis of voting patterns using LDA.
>
> http://arxiv.org/pdf/math/0604410.pdf
>
>
>
> On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit <ro...@gmail.com>
> wrote:
>
> > Ted,
> >
> > I know LDA can be used to model text data but never used it in this
> > setting. Can you please give me some pointers on how I can apply it in
> this
> > setting?
> >
> > Thanks,
> > Rohit
> >
> > On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > This is an incredibly tiny dataset.  If you delete singletons, it is
> > likely
> > > to get significantly smaller.
> > >
> > > I think that something like LDA might work much better for you. It was
> > > designed to work on small data like this.
> > >
> > >
> > > On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <rohit.parimi@gmail.com
> >
> > > wrote:
> > >
> > > > Ted, Thanks for your response. Following is the information about the
> > > > approach and the datasets:
> > > >
> > > > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > > > prefCount" tuples as input to compute user-user similarity using
> LLR. I
> > > > read this approach from a response for one of the stackoverflow
> > questions
> > > > on calculating user similarity using mahout. .
> > > >
> > > >
> > > > Following are the stats for the datasets:
> > > >
> > > > Coauthor dataset:
> > > >
> > > > users = 29189
> > > > items =  140091
> > > > averageItemsClicked = 15.808660796875536
> > > >
> > > > Conference Dataset:
> > > >
> > > > users = 29189
> > > > items =  2393
> > > > averageItemsClicked = 7.265099866388023
> > > >
> > > > Reference Dataset:
> > > >
> > > > users = 29189
> > > > items =  201570
> > > > averageItemsClicked = 61.08564870327863
> > > >
> > > > By Scale, did you mean rating scale? If so, I am using preference
> > counts,
> > > > not rating.
> > > >
> > > > Thanks,
> > > > Rohit
> > > >
> > > >
> > > > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > >
> > > > > How are you using LLR to compute user similarity?  It is normally
> > used
> > > to
> > > > > compute item similarity?
> > > > >
> > > > > Also, what is your scale?  how many users? how many items?  how
> many
> > > > > actions per user?
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <
> > rohit.parimi@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am exploring a random-walk based algorithm for recommender
> > systems
> > > > > which
> > > > > > works by propagating the item preferences for users on the
> > user-user
> > > > > graph.
> > > > > > To do this, I have to compute user-user similarity and form a
> > > > > neighborhood.
> > > > > > I have tried the following three simple techniques to compute the
> > > score
> > > > > > between two users and find the neighborhood.
> > > > > >
> > > > > > 1. Score = (Common Items between users A and B) / (items
> preferred
> > by
> > > > A +
> > > > > > items Preferred by B)
> > > > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > > > >
> > > > > > My understanding is that similarity based on LogLikelihood is
> more
> > > > > robust,
> > > > > > however, I get better results using the naive approach
> (technique 1
> > > > from
> > > > > > the above list). The problems I am addressing are collaborator
> > > > > > recommendation, conference recommendation and reference
> > > recommendation
> > > > > and
> > > > > > the data has implicit feedback.
> > > > > >
> > > > > > So, my questions is, are there any cases where cosine similarity
> > and
> > > > > > loglikelihood metrics fail (to capture similarity), for example,
> > for
> > > > the
> > > > > > problems stated above, users only collaborate with few other
> users
> > > > (based
> > > > > > on area of interest), publish in only few conferences (again
> based
> > on
> > > > > area
> > > > > > of interest) and refer to publications in a specific domain. So,
> > the
> > > > > > preference counts are fairly small compared to other domains
> > > > (music/video
> > > > > > etc).
> > > > > >
> > > > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > > > boolean
> > > > > > or use the counts? (I think loglikelihood metric does not take
> into
> > > > > account
> > > > > > the preference counts.. correct me if I am wrong.)
> > > > > >
> > > > > > Any insight into this is much appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Rohit
> > > > > >
> > > > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > > > "LogLikelihoodSimilarity Calculation" and your answers helped me
> a
> > > lot
> > > > to
> > > > > > understand how it works and made me wonder why things are
> different
> > > in
> > > > my
> > > > > > case.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Ted Dunning <te...@gmail.com>.

Here is a paper that includes an analysis of voting patterns using LDA.

http://arxiv.org/pdf/math/0604410.pdf



On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit <ro...@gmail.com>
wrote:

> Ted,
>
> I know LDA can be used to model text data but never used it in this
> setting. Can you please give me some pointers on how I can apply it in this
> setting?
>
> Thanks,
> Rohit
>
> On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > This is an incredibly tiny dataset.  If you delete singletons, it is
> likely
> > to get significantly smaller.
> >
> > I think that something like LDA might work much better for you. It was
> > designed to work on small data like this.
> >
> >
> > On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <ro...@gmail.com>
> > wrote:
> >
> > > Ted, Thanks for your response. Following is the information about the
> > > approach and the datasets:
> > >
> > > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > > prefCount" tuples as input to compute user-user similarity using LLR. I
> > > read this approach from a response for one of the stackoverflow
> questions
> > > on calculating user similarity using mahout. .
> > >
> > >
> > > Following are the stats for the datasets:
> > >
> > > Coauthor dataset:
> > >
> > > users = 29189
> > > items =  140091
> > > averageItemsClicked = 15.808660796875536
> > >
> > > Conference Dataset:
> > >
> > > users = 29189
> > > items =  2393
> > > averageItemsClicked = 7.265099866388023
> > >
> > > Reference Dataset:
> > >
> > > users = 29189
> > > items =  201570
> > > averageItemsClicked = 61.08564870327863
> > >
> > > By Scale, did you mean rating scale? If so, I am using preference
> counts,
> > > not rating.
> > >
> > > Thanks,
> > > Rohit
> > >
> > >
> > > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > How are you using LLR to compute user similarity?  It is normally
> used
> > to
> > > > compute item similarity?
> > > >
> > > > Also, what is your scale?  how many users? how many items?  how many
> > > > actions per user?
> > > >
> > > >
> > > >
> > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <
> rohit.parimi@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am exploring a random-walk based algorithm for recommender
> systems
> > > > which
> > > > > works by propagating the item preferences for users on the
> user-user
> > > > graph.
> > > > > To do this, I have to compute user-user similarity and form a
> > > > neighborhood.
> > > > > I have tried the following three simple techniques to compute the
> > score
> > > > > between two users and find the neighborhood.
> > > > >
> > > > > 1. Score = (Common Items between users A and B) / (items preferred
> by
> > > A +
> > > > > items Preferred by B)
> > > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > > >
> > > > > My understanding is that similarity based on LogLikelihood is more
> > > > robust,
> > > > > however, I get better results using the naive approach (technique 1
> > > from
> > > > > the above list). The problems I am addressing are collaborator
> > > > > recommendation, conference recommendation and reference
> > recommendation
> > > > and
> > > > > the data has implicit feedback.
> > > > >
> > > > > So, my questions is, are there any cases where cosine similarity
> and
> > > > > loglikelihood metrics fail (to capture similarity), for example,
> for
> > > the
> > > > > problems stated above, users only collaborate with few other users
> > > (based
> > > > > on area of interest), publish in only few conferences (again based
> on
> > > > area
> > > > > of interest) and refer to publications in a specific domain. So,
> the
> > > > > preference counts are fairly small compared to other domains
> > > (music/video
> > > > > etc).
> > > > >
> > > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > > boolean
> > > > > or use the counts? (I think loglikelihood metric does not take into
> > > > account
> > > > > the preference counts.. correct me if I am wrong.)
> > > > >
> > > > > Any insight into this is much appreciated.
> > > > >
> > > > > Thanks,
> > > > > Rohit
> > > > >
> > > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > > "LogLikelihoodSimilarity Calculation" and your answers helped me a
> > lot
> > > to
> > > > > understand how it works and made me wonder why things are different
> > in
> > > my
> > > > > case.
> > > > >
> > > >
> > >
> >
>

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Parimi Rohit <ro...@gmail.com>.

Ted,

I know LDA can be used to model text data but never used it in this
setting. Can you please give me some pointers on how I can apply it in this
setting?

Thanks,
Rohit

On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning <te...@gmail.com> wrote:

> This is an incredibly tiny dataset.  If you delete singletons, it is likely
> to get significantly smaller.
>
> I think that something like LDA might work much better for you. It was
> designed to work on small data like this.
>
>
> On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <ro...@gmail.com>
> wrote:
>
> > Ted, Thanks for your response. Following is the information about the
> > approach and the datasets:
> >
> > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > prefCount" tuples as input to compute user-user similarity using LLR. I
> > read this approach from a response for one of the stackoverflow questions
> > on calculating user similarity using mahout. .
> >
> >
> > Following are the stats for the datasets:
> >
> > Coauthor dataset:
> >
> > users = 29189
> > items =  140091
> > averageItemsClicked = 15.808660796875536
> >
> > Conference Dataset:
> >
> > users = 29189
> > items =  2393
> > averageItemsClicked = 7.265099866388023
> >
> > Reference Dataset:
> >
> > users = 29189
> > items =  201570
> > averageItemsClicked = 61.08564870327863
> >
> > By Scale, did you mean rating scale? If so, I am using preference counts,
> > not rating.
> >
> > Thanks,
> > Rohit
> >
> >
> > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > How are you using LLR to compute user similarity?  It is normally used
> to
> > > compute item similarity?
> > >
> > > Also, what is your scale?  how many users? how many items?  how many
> > > actions per user?
> > >
> > >
> > >
> > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <ro...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am exploring a random-walk based algorithm for recommender systems
> > > which
> > > > works by propagating the item preferences for users on the user-user
> > > graph.
> > > > To do this, I have to compute user-user similarity and form a
> > > neighborhood.
> > > > I have tried the following three simple techniques to compute the
> score
> > > > between two users and find the neighborhood.
> > > >
> > > > 1. Score = (Common Items between users A and B) / (items preferred by
> > A +
> > > > items Preferred by B)
> > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > >
> > > > My understanding is that similarity based on LogLikelihood is more
> > > robust,
> > > > however, I get better results using the naive approach (technique 1
> > from
> > > > the above list). The problems I am addressing are collaborator
> > > > recommendation, conference recommendation and reference
> recommendation
> > > and
> > > > the data has implicit feedback.
> > > >
> > > > So, my questions is, are there any cases where cosine similarity and
> > > > loglikelihood metrics fail (to capture similarity), for example, for
> > the
> > > > problems stated above, users only collaborate with few other users
> > (based
> > > > on area of interest), publish in only few conferences (again based on
> > > area
> > > > of interest) and refer to publications in a specific domain. So, the
> > > > preference counts are fairly small compared to other domains
> > (music/video
> > > > etc).
> > > >
> > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > boolean
> > > > or use the counts? (I think loglikelihood metric does not take into
> > > account
> > > > the preference counts.. correct me if I am wrong.)
> > > >
> > > > Any insight into this is much appreciated.
> > > >
> > > > Thanks,
> > > > Rohit
> > > >
> > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > "LogLikelihoodSimilarity Calculation" and your answers helped me a
> lot
> > to
> > > > understand how it works and made me wonder why things are different
> in
> > my
> > > > case.
> > > >
> > >
> >
>

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Ted Dunning <te...@gmail.com>.

This is an incredibly tiny dataset.  If you delete singletons, it is likely
to get significantly smaller.

I think that something like LDA might work much better for you. It was
designed to work on small data like this.


On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <ro...@gmail.com>
wrote:

> Ted, Thanks for your response. Following is the information about the
> approach and the datasets:
>
> I am using the ItemSimilarityJob and passing it  "itemID, userID,
> prefCount" tuples as input to compute user-user similarity using LLR. I
> read this approach from a response for one of the stackoverflow questions
> on calculating user similarity using mahout. .
>
>
> Following are the stats for the datasets:
>
> Coauthor dataset:
>
> users = 29189
> items =  140091
> averageItemsClicked = 15.808660796875536
>
> Conference Dataset:
>
> users = 29189
> items =  2393
> averageItemsClicked = 7.265099866388023
>
> Reference Dataset:
>
> users = 29189
> items =  201570
> averageItemsClicked = 61.08564870327863
>
> By Scale, did you mean rating scale? If so, I am using preference counts,
> not rating.
>
> Thanks,
> Rohit
>
>
> On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > How are you using LLR to compute user similarity?  It is normally used to
> > compute item similarity?
> >
> > Also, what is your scale?  how many users? how many items?  how many
> > actions per user?
> >
> >
> >
> > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <ro...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I am exploring a random-walk based algorithm for recommender systems
> > which
> > > works by propagating the item preferences for users on the user-user
> > graph.
> > > To do this, I have to compute user-user similarity and form a
> > neighborhood.
> > > I have tried the following three simple techniques to compute the score
> > > between two users and find the neighborhood.
> > >
> > > 1. Score = (Common Items between users A and B) / (items preferred by
> A +
> > > items Preferred by B)
> > > 2. Scoring based on Mahout's Cosine Similarity
> > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > >
> > > My understanding is that similarity based on LogLikelihood is more
> > robust,
> > > however, I get better results using the naive approach (technique 1
> from
> > > the above list). The problems I am addressing are collaborator
> > > recommendation, conference recommendation and reference recommendation
> > and
> > > the data has implicit feedback.
> > >
> > > So, my questions is, are there any cases where cosine similarity and
> > > loglikelihood metrics fail (to capture similarity), for example, for
> the
> > > problems stated above, users only collaborate with few other users
> (based
> > > on area of interest), publish in only few conferences (again based on
> > area
> > > of interest) and refer to publications in a specific domain. So, the
> > > preference counts are fairly small compared to other domains
> (music/video
> > > etc).
> > >
> > > Secondly, for CosineSimilarity, should I treat the preferences as
> boolean
> > > or use the counts? (I think loglikelihood metric does not take into
> > account
> > > the preference counts.. correct me if I am wrong.)
> > >
> > > Any insight into this is much appreciated.
> > >
> > > Thanks,
> > > Rohit
> > >
> > > p.s. Ted, Pat: I am following the discussion on the thread
> > > "LogLikelihoodSimilarity Calculation" and your answers helped me a lot
> to
> > > understand how it works and made me wonder why things are different in
> my
> > > case.
> > >
> >
>

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Parimi Rohit <ro...@gmail.com>.

Ted, Thanks for your response. Following is the information about the
approach and the datasets:

I am using the ItemSimilarityJob and passing it  "itemID, userID,
prefCount" tuples as input to compute user-user similarity using LLR. I
read this approach from a response for one of the stackoverflow questions
on calculating user similarity using mahout. .


Following are the stats for the datasets:

Coauthor dataset:

users = 29189
items =  140091
averageItemsClicked = 15.808660796875536

Conference Dataset:

users = 29189
items =  2393
averageItemsClicked = 7.265099866388023

Reference Dataset:

users = 29189
items =  201570
averageItemsClicked = 61.08564870327863

By Scale, did you mean rating scale? If so, I am using preference counts,
not rating.

Thanks,
Rohit


On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <te...@gmail.com> wrote:

> How are you using LLR to compute user similarity?  It is normally used to
> compute item similarity?
>
> Also, what is your scale?  how many users? how many items?  how many
> actions per user?
>
>
>
> On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <ro...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I am exploring a random-walk based algorithm for recommender systems
> which
> > works by propagating the item preferences for users on the user-user
> graph.
> > To do this, I have to compute user-user similarity and form a
> neighborhood.
> > I have tried the following three simple techniques to compute the score
> > between two users and find the neighborhood.
> >
> > 1. Score = (Common Items between users A and B) / (items preferred by A +
> > items Preferred by B)
> > 2. Scoring based on Mahout's Cosine Similarity
> > 3. Scoring based on Mahout's LogLikelihood similarity.
> >
> > My understanding is that similarity based on LogLikelihood is more
> robust,
> > however, I get better results using the naive approach (technique 1 from
> > the above list). The problems I am addressing are collaborator
> > recommendation, conference recommendation and reference recommendation
> and
> > the data has implicit feedback.
> >
> > So, my questions is, are there any cases where cosine similarity and
> > loglikelihood metrics fail (to capture similarity), for example, for the
> > problems stated above, users only collaborate with few other users (based
> > on area of interest), publish in only few conferences (again based on
> area
> > of interest) and refer to publications in a specific domain. So, the
> > preference counts are fairly small compared to other domains (music/video
> > etc).
> >
> > Secondly, for CosineSimilarity, should I treat the preferences as boolean
> > or use the counts? (I think loglikelihood metric does not take into
> account
> > the preference counts.. correct me if I am wrong.)
> >
> > Any insight into this is much appreciated.
> >
> > Thanks,
> > Rohit
> >
> > p.s. Ted, Pat: I am following the discussion on the thread
> > "LogLikelihoodSimilarity Calculation" and your answers helped me a lot to
> > understand how it works and made me wonder why things are different in my
> > case.
> >
>

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Ted Dunning <te...@gmail.com>.

How are you using LLR to compute user similarity?  It is normally used to
compute item similarity?

Also, what is your scale?  how many users? how many items?  how many
actions per user?



On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <ro...@gmail.com>
wrote:

> Hi,
>
> I am exploring a random-walk based algorithm for recommender systems which
> works by propagating the item preferences for users on the user-user graph.
> To do this, I have to compute user-user similarity and form a neighborhood.
> I have tried the following three simple techniques to compute the score
> between two users and find the neighborhood.
>
> 1. Score = (Common Items between users A and B) / (items preferred by A +
> items Preferred by B)
> 2. Scoring based on Mahout's Cosine Similarity
> 3. Scoring based on Mahout's LogLikelihood similarity.
>
> My understanding is that similarity based on LogLikelihood is more robust,
> however, I get better results using the naive approach (technique 1 from
> the above list). The problems I am addressing are collaborator
> recommendation, conference recommendation and reference recommendation and
> the data has implicit feedback.
>
> So, my questions is, are there any cases where cosine similarity and
> loglikelihood metrics fail (to capture similarity), for example, for the
> problems stated above, users only collaborate with few other users (based
> on area of interest), publish in only few conferences (again based on area
> of interest) and refer to publications in a specific domain. So, the
> preference counts are fairly small compared to other domains (music/video
> etc).
>
> Secondly, for CosineSimilarity, should I treat the preferences as boolean
> or use the counts? (I think loglikelihood metric does not take into account
> the preference counts.. correct me if I am wrong.)
>
> Any insight into this is much appreciated.
>
> Thanks,
> Rohit
>
> p.s. Ted, Pat: I am following the discussion on the thread
> "LogLikelihoodSimilarity Calculation" and your answers helped me a lot to
> understand how it works and made me wonder why things are different in my
> case.
>