You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Parimi Rohit <ro...@gmail.com> on 2014/10/01 02:04:16 UTC

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Ted,

I know LDA can be used to model text data but never used it in this
setting. Can you please give me some pointers on how I can apply it in this
setting?

Thanks,
Rohit

On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning <te...@gmail.com> wrote:

> This is an incredibly tiny dataset.  If you delete singletons, it is likely
> to get significantly smaller.
>
> I think that something like LDA might work much better for you. It was
> designed to work on small data like this.
>
>
> On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <ro...@gmail.com>
> wrote:
>
> > Ted, Thanks for your response. Following is the information about the
> > approach and the datasets:
> >
> > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > prefCount" tuples as input to compute user-user similarity using LLR. I
> > read this approach from a response for one of the stackoverflow questions
> > on calculating user similarity using mahout. .
> >
> >
> > Following are the stats for the datasets:
> >
> > Coauthor dataset:
> >
> > users = 29189
> > items =  140091
> > averageItemsClicked = 15.808660796875536
> >
> > Conference Dataset:
> >
> > users = 29189
> > items =  2393
> > averageItemsClicked = 7.265099866388023
> >
> > Reference Dataset:
> >
> > users = 29189
> > items =  201570
> > averageItemsClicked = 61.08564870327863
> >
> > By Scale, did you mean rating scale? If so, I am using preference counts,
> > not rating.
> >
> > Thanks,
> > Rohit
> >
> >
> > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > How are you using LLR to compute user similarity?  It is normally used
> to
> > > compute item similarity?
> > >
> > > Also, what is your scale?  how many users? how many items?  how many
> > > actions per user?
> > >
> > >
> > >
> > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <ro...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am exploring a random-walk based algorithm for recommender systems
> > > which
> > > > works by propagating the item preferences for users on the user-user
> > > graph.
> > > > To do this, I have to compute user-user similarity and form a
> > > neighborhood.
> > > > I have tried the following three simple techniques to compute the
> score
> > > > between two users and find the neighborhood.
> > > >
> > > > 1. Score = (Common Items between users A and B) / (items preferred by
> > A +
> > > > items Preferred by B)
> > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > >
> > > > My understanding is that similarity based on LogLikelihood is more
> > > robust,
> > > > however, I get better results using the naive approach (technique 1
> > from
> > > > the above list). The problems I am addressing are collaborator
> > > > recommendation, conference recommendation and reference
> recommendation
> > > and
> > > > the data has implicit feedback.
> > > >
> > > > So, my questions is, are there any cases where cosine similarity and
> > > > loglikelihood metrics fail (to capture similarity), for example, for
> > the
> > > > problems stated above, users only collaborate with few other users
> > (based
> > > > on area of interest), publish in only few conferences (again based on
> > > area
> > > > of interest) and refer to publications in a specific domain. So, the
> > > > preference counts are fairly small compared to other domains
> > (music/video
> > > > etc).
> > > >
> > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > boolean
> > > > or use the counts? (I think loglikelihood metric does not take into
> > > account
> > > > the preference counts.. correct me if I am wrong.)
> > > >
> > > > Any insight into this is much appreciated.
> > > >
> > > > Thanks,
> > > > Rohit
> > > >
> > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > "LogLikelihoodSimilarity Calculation" and your answers helped me a
> lot
> > to
> > > > understand how it works and made me wonder why things are different
> in
> > my
> > > > case.
> > > >
> > >
> >
>

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Parimi Rohit <ro...@gmail.com>.

Thanks Ted! Will look into it.

Rohit

On Wed, Oct 1, 2014 at 1:04 AM, Ted Dunning <te...@gmail.com> wrote:

> Here is a paper that includes an analysis of voting patterns using LDA.
>
> http://arxiv.org/pdf/math/0604410.pdf
>
>
>
> On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit <ro...@gmail.com>
> wrote:
>
> > Ted,
> >
> > I know LDA can be used to model text data but never used it in this
> > setting. Can you please give me some pointers on how I can apply it in
> this
> > setting?
> >
> > Thanks,
> > Rohit
> >
> > On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > This is an incredibly tiny dataset.  If you delete singletons, it is
> > likely
> > > to get significantly smaller.
> > >
> > > I think that something like LDA might work much better for you. It was
> > > designed to work on small data like this.
> > >
> > >
> > > On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <rohit.parimi@gmail.com
> >
> > > wrote:
> > >
> > > > Ted, Thanks for your response. Following is the information about the
> > > > approach and the datasets:
> > > >
> > > > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > > > prefCount" tuples as input to compute user-user similarity using
> LLR. I
> > > > read this approach from a response for one of the stackoverflow
> > questions
> > > > on calculating user similarity using mahout. .
> > > >
> > > >
> > > > Following are the stats for the datasets:
> > > >
> > > > Coauthor dataset:
> > > >
> > > > users = 29189
> > > > items =  140091
> > > > averageItemsClicked = 15.808660796875536
> > > >
> > > > Conference Dataset:
> > > >
> > > > users = 29189
> > > > items =  2393
> > > > averageItemsClicked = 7.265099866388023
> > > >
> > > > Reference Dataset:
> > > >
> > > > users = 29189
> > > > items =  201570
> > > > averageItemsClicked = 61.08564870327863
> > > >
> > > > By Scale, did you mean rating scale? If so, I am using preference
> > counts,
> > > > not rating.
> > > >
> > > > Thanks,
> > > > Rohit
> > > >
> > > >
> > > > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <ted.dunning@gmail.com
> >
> > > > wrote:
> > > >
> > > > > How are you using LLR to compute user similarity?  It is normally
> > used
> > > to
> > > > > compute item similarity?
> > > > >
> > > > > Also, what is your scale?  how many users? how many items?  how
> many
> > > > > actions per user?
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <
> > rohit.parimi@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am exploring a random-walk based algorithm for recommender
> > systems
> > > > > which
> > > > > > works by propagating the item preferences for users on the
> > user-user
> > > > > graph.
> > > > > > To do this, I have to compute user-user similarity and form a
> > > > > neighborhood.
> > > > > > I have tried the following three simple techniques to compute the
> > > score
> > > > > > between two users and find the neighborhood.
> > > > > >
> > > > > > 1. Score = (Common Items between users A and B) / (items
> preferred
> > by
> > > > A +
> > > > > > items Preferred by B)
> > > > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > > > >
> > > > > > My understanding is that similarity based on LogLikelihood is
> more
> > > > > robust,
> > > > > > however, I get better results using the naive approach
> (technique 1
> > > > from
> > > > > > the above list). The problems I am addressing are collaborator
> > > > > > recommendation, conference recommendation and reference
> > > recommendation
> > > > > and
> > > > > > the data has implicit feedback.
> > > > > >
> > > > > > So, my questions is, are there any cases where cosine similarity
> > and
> > > > > > loglikelihood metrics fail (to capture similarity), for example,
> > for
> > > > the
> > > > > > problems stated above, users only collaborate with few other
> users
> > > > (based
> > > > > > on area of interest), publish in only few conferences (again
> based
> > on
> > > > > area
> > > > > > of interest) and refer to publications in a specific domain. So,
> > the
> > > > > > preference counts are fairly small compared to other domains
> > > > (music/video
> > > > > > etc).
> > > > > >
> > > > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > > > boolean
> > > > > > or use the counts? (I think loglikelihood metric does not take
> into
> > > > > account
> > > > > > the preference counts.. correct me if I am wrong.)
> > > > > >
> > > > > > Any insight into this is much appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Rohit
> > > > > >
> > > > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > > > "LogLikelihoodSimilarity Calculation" and your answers helped me
> a
> > > lot
> > > > to
> > > > > > understand how it works and made me wonder why things are
> different
> > > in
> > > > my
> > > > > > case.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

Posted by Ted Dunning <te...@gmail.com>.

Here is a paper that includes an analysis of voting patterns using LDA.

http://arxiv.org/pdf/math/0604410.pdf



On Tue, Sep 30, 2014 at 7:04 PM, Parimi Rohit <ro...@gmail.com>
wrote:

> Ted,
>
> I know LDA can be used to model text data but never used it in this
> setting. Can you please give me some pointers on how I can apply it in this
> setting?
>
> Thanks,
> Rohit
>
> On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > This is an incredibly tiny dataset.  If you delete singletons, it is
> likely
> > to get significantly smaller.
> >
> > I think that something like LDA might work much better for you. It was
> > designed to work on small data like this.
> >
> >
> > On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <ro...@gmail.com>
> > wrote:
> >
> > > Ted, Thanks for your response. Following is the information about the
> > > approach and the datasets:
> > >
> > > I am using the ItemSimilarityJob and passing it  "itemID, userID,
> > > prefCount" tuples as input to compute user-user similarity using LLR. I
> > > read this approach from a response for one of the stackoverflow
> questions
> > > on calculating user similarity using mahout. .
> > >
> > >
> > > Following are the stats for the datasets:
> > >
> > > Coauthor dataset:
> > >
> > > users = 29189
> > > items =  140091
> > > averageItemsClicked = 15.808660796875536
> > >
> > > Conference Dataset:
> > >
> > > users = 29189
> > > items =  2393
> > > averageItemsClicked = 7.265099866388023
> > >
> > > Reference Dataset:
> > >
> > > users = 29189
> > > items =  201570
> > > averageItemsClicked = 61.08564870327863
> > >
> > > By Scale, did you mean rating scale? If so, I am using preference
> counts,
> > > not rating.
> > >
> > > Thanks,
> > > Rohit
> > >
> > >
> > > On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > How are you using LLR to compute user similarity?  It is normally
> used
> > to
> > > > compute item similarity?
> > > >
> > > > Also, what is your scale?  how many users? how many items?  how many
> > > > actions per user?
> > > >
> > > >
> > > >
> > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <
> rohit.parimi@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am exploring a random-walk based algorithm for recommender
> systems
> > > > which
> > > > > works by propagating the item preferences for users on the
> user-user
> > > > graph.
> > > > > To do this, I have to compute user-user similarity and form a
> > > > neighborhood.
> > > > > I have tried the following three simple techniques to compute the
> > score
> > > > > between two users and find the neighborhood.
> > > > >
> > > > > 1. Score = (Common Items between users A and B) / (items preferred
> by
> > > A +
> > > > > items Preferred by B)
> > > > > 2. Scoring based on Mahout's Cosine Similarity
> > > > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > > > >
> > > > > My understanding is that similarity based on LogLikelihood is more
> > > > robust,
> > > > > however, I get better results using the naive approach (technique 1
> > > from
> > > > > the above list). The problems I am addressing are collaborator
> > > > > recommendation, conference recommendation and reference
> > recommendation
> > > > and
> > > > > the data has implicit feedback.
> > > > >
> > > > > So, my questions is, are there any cases where cosine similarity
> and
> > > > > loglikelihood metrics fail (to capture similarity), for example,
> for
> > > the
> > > > > problems stated above, users only collaborate with few other users
> > > (based
> > > > > on area of interest), publish in only few conferences (again based
> on
> > > > area
> > > > > of interest) and refer to publications in a specific domain. So,
> the
> > > > > preference counts are fairly small compared to other domains
> > > (music/video
> > > > > etc).
> > > > >
> > > > > Secondly, for CosineSimilarity, should I treat the preferences as
> > > boolean
> > > > > or use the counts? (I think loglikelihood metric does not take into
> > > > account
> > > > > the preference counts.. correct me if I am wrong.)
> > > > >
> > > > > Any insight into this is much appreciated.
> > > > >
> > > > > Thanks,
> > > > > Rohit
> > > > >
> > > > > p.s. Ted, Pat: I am following the discussion on the thread
> > > > > "LogLikelihoodSimilarity Calculation" and your answers helped me a
> > lot
> > > to
> > > > > understand how it works and made me wonder why things are different
> > in
> > > my
> > > > > case.
> > > > >
> > > >
> > >
> >
>