You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Matthew Runo <ma...@gmail.com> on 2011/02/16 23:24:42 UTC

Sparse data & Item Similarity

Hello folks -

(I think that) I'm running into an issue with my user data being too
sparse with my item-item similarity calculations. A typical item_id in
my data might have about 2000 links to other items, but very few
"combinations" of users have viewed the same products.

For example we have two items, 1244 and 2319 - and there are only
three users in common between them.

So, there's only those three users who viewed both items. I'm
assigning preferences to different types of actions in my data.. and
since all three users did the same action towards the item, they have
the same preference value. Maybe I just need to start with a bigger
set of data to get more links between items in different "actions" in
order to spread out the generated similarities? I'm using the
EuclideanDistanceSimilarity to do the final computation.

I think this is leading to a huge number of "1" values being returned.
Nearly 72% of my item-item similarities are 1.0. I feel that this is
invalid, but I'm not quite sure of the best way to attack it.

There are some similarities of 1 where the items do not appear to be
similar at all, and the best I've been able to come up with as to how
the 1 came around was that there was only one user who had a link
between them and so that one user.

How many item-user-item combinations per item pair does it take to get
good output?

Sorry if I'm not quite describing my problem in the proper terms..

--Matthew Runo

Re: Sparse data & Item Similarity

Posted by Ted Dunning <te...@gmail.com>.

When you have small counts lots of item similarity measures fall apart by giving high scores to hapax phenomena. 

Log likelihood ratio scores can help with this by letting you filter away uninteresting coincidences 

Sent from my iPhone

On Feb 16, 2011, at 2:24 PM, Matthew Runo <ma...@gmail.com> wrote:

> Hello folks -
> 
> (I think that) I'm running into an issue with my user data being too
> sparse with my item-item similarity calculations. A typical item_id in
> my data might have about 2000 links to other items, but very few
> "combinations" of users have viewed the same products.
> 
> For example we have two items, 1244 and 2319 - and there are only
> three users in common between them.
> 
> So, there's only those three users who viewed both items. I'm
> assigning preferences to different types of actions in my data.. and
> since all three users did the same action towards the item, they have
> the same preference value. Maybe I just need to start with a bigger
> set of data to get more links between items in different "actions" in
> order to spread out the generated similarities? I'm using the
> EuclideanDistanceSimilarity to do the final computation.
> 
> I think this is leading to a huge number of "1" values being returned.
> Nearly 72% of my item-item similarities are 1.0. I feel that this is
> invalid, but I'm not quite sure of the best way to attack it.
> 
> There are some similarities of 1 where the items do not appear to be
> similar at all, and the best I've been able to come up with as to how
> the 1 came around was that there was only one user who had a link
> between them and so that one user.
> 
> How many item-user-item combinations per item pair does it take to get
> good output?
> 
> Sorry if I'm not quite describing my problem in the proper terms..
> 
> --Matthew Runo

Re: Sparse data & Item Similarity

Posted by Ted Dunning <te...@gmail.com>.

Another option here is to use a biased ratio.  This is more common in
computing popularity.  If you take the average popularity as k_0 / n_0, then
you can estimate the popularity of a thing that has been liked/viewed/rating
k times out of n opportunities as (k + k_0) / (n + n_0).  Pick n_0 to
determine the degree of skepticism the system has and how much data it takes
to overcome its preconceived estimate of popularity.  Picking n_0 will fix
k_0 because the ratio has to match the average rate.

This trick has surprisingly deep mathematical roots and works pretty darned
well.

On Wed, Feb 16, 2011 at 10:59 PM, Sean Owen <sr...@gmail.com> wrote:

> > Second: I agree that the likelihood approach (i.e. boolean preferences)
> helps a lot with sparse data.  So, my question is given a simple
> log-likelihood log(r/m+n) where r is the number of prefs in common and m+n
> is the total number of prefs in the two vectors, and the Pearson correlation
> of the intersection, wouldn't the product of these two approximate the true
> cosine similarity taking into account the ratings?
>
> (That's not quite log-likelihood -- looks more like the Tanimoto
> coefficient of r/m+n-r. LL is something a bit more subtle.)
>
> I'd hesitate to call what you have in mind the "true" cosine
> similarity for the reason above. It's really the result of inferring 0
> for missing data, which is less true to the data.

Re: Sparse data & Item Similarity

Posted by Sean Owen <sr...@gmail.com>.

On Thu, Feb 17, 2011 at 5:58 AM, Chris Schilling <ch...@cellixis.com> wrote:
> First.  It is apparent that when dealing with sparse data, which most CF systems seem to, the Pearson/cosine/Euclidean similarity metrics are not extremely useful.  They do seem to be very useful, however, when dealing with dense vectors/matrices.

I would say that in any data set you have some pockets of dense-ness
and a large long tail of sparse-ness -- items which co-occur once for
example. And these metrics don't take much account of the chance of
coincidence or anomalous-ness. So, your results are more frequently
skewed by anomalous pairs.

This is an example of how ignoring some data can help if it's noise,
and a surprising amount of data is noise. Ignoring values entirely
here helped. It's not always true, but more than you'd imagine.

(But see below about how to make Pearson more usable in practice.)

> One question I have regarding the cosine similarity: it seems this is calculated with respect to the intersection of the two vectors.  What would happen if we actually divided the dot product by the total magnitudes (i.e. not just the magnitude of the intersection)?  Wouldn't that place more weight on the vectors which have more ratings in common?

This would have the effect of assuming that the "missing" values
(where one user has a rating and other doesn't) are 0. If "0" in your
rating system happens to correspond to a very neutral value, that's
pretty valid. If you're on a 1-5 rating system, it probably isn't. It
would be like assuming that anything you haven't watched is utterly
hated.

You could fill in the user's average rating for missing values. This
is what PreferenceInferrer does for you in the API if you like. It'll
slow things down but you can try it to combat this effect.

There's already an option in the code to weight the result by the
count of data points (items in common) -- use Weighted.WEIGHTED. It
simply pushes the result closer to 1 or -1 in a reasonable way.
There's nothing magical or particularly valid or invalid about the
math there but does what you want without some unwanted assumptions.

> Second: I agree that the likelihood approach (i.e. boolean preferences) helps a lot with sparse data.  So, my question is given a simple log-likelihood log(r/m+n) where r is the number of prefs in common and m+n is the total number of prefs in the two vectors, and the Pearson correlation of the intersection, wouldn't the product of these two approximate the true cosine similarity taking into account the ratings?

(That's not quite log-likelihood -- looks more like the Tanimoto
coefficient of r/m+n-r. LL is something a bit more subtle.)

I'd hesitate to call what you have in mind the "true" cosine
similarity for the reason above. It's really the result of inferring 0
for missing data, which is less true to the data.

The product of these two similarities may be useful, but I don't know
that it has any particular mathematical interpretation.

>
> The main problem with the likelihood is that it does not take into account one user disliking and another user liking the same item.  This seems to be more important in dealing with very sparse data.  However, I do understand the motivation, especially given that users more generally rate what they like and less what they dislike.

One common approach is to segment the data into, effectively, two
boolean models: things I hate and things I like. And recommend from
both, and use the output of both to determine how relatively likely it
is you love vs  hate an item.

Re: Sparse data & Item Similarity

Posted by Lance Norskog <go...@gmail.com>.

Another approach is to say that the distance measures are only
interesting close up, but farther measurements are dubious.  Assign a
usefulness factor to each distances as maybe the log of the distance,
normalized to 0->1. You can now apply fuzzy math algebra.

On Thu, Feb 17, 2011 at 1:03 AM, Dinesh B Vadhia
<di...@hotmail.com> wrote:
>>>
>>> The main problem with the likelihood is that it does not take into account
>>> one user disliking and another user liking the same item.  This seems to be
>>> more important in dealing with very sparse data.  However, I do understand
>>> the motivation, especially given that users more generally rate what they
>>> like and less what they dislike.
>>>
>
>> This is unfortunately really complicated.  If you are talking about ratings,
>> then negative ratings tell you more about what somebody likes than about
>> what they dislike.  If you are talking about implicit data, then negative
>> ratings are all items with which the user *might* have interacted (i.e.
>> roughly a skazillion things).  Mostly they don't interact with these things
>> because something else caught their eye or they are in a bad mood.  That
>> doesn't mean much.
>
>
> Its not really complicated to take into account one user liking and another user disliking the same item if an appropriate model is used.  The problem with CF is that its support of sparsity is poor particualrly for big data.
>
>
>
> From: Ted Dunning
> Sent: Wednesday, February 16, 2011 11:33 PM
> To: user@mahout.apache.org
> Cc: Chris Schilling
> Subject: Re: Sparse data & Item Similarity
>
>
> On Wed, Feb 16, 2011 at 9:58 PM, Chris Schilling <ch...@cellixis.com> wrote:
>
>> First.  It is apparent that when dealing with sparse data, which most CF
>> systems seem to, the Pearson/cosine/Euclidean similarity metrics are not
>> extremely useful.  They do seem to be very useful, however, when dealing
>> with dense vectors/matrices.
>>
>
> Seems about right.
>
>
>> One question I have regarding the cosine similarity: it seems this is
>> calculated with respect to the intersection of the two vectors.  What would
>> happen if we actually divided the dot product by the total magnitudes (i.e.
>> not just the magnitude of the intersection)?  Wouldn't that place more
>> weight on the vectors which have more ratings in common?
>>
>
> Cosine is defined as the dot product over the product of the L_2 magnitudes
> so it is normalized to the -1 to 1 range.
>
> That isn't really the problem.  The problem is cases where you have two
> users to rated (interacted with) exactly one item and that happens to be the
> same item.
>
> You can divide by the product of the L_1 or L_0 norms, but that doesn't
> change the situation much.
>
> Second: I agree that the likelihood approach (i.e. boolean preferences)
>> helps a lot with sparse data.  So, my question is given a simple
>> log-likelihood log(r/m+n) where r is the number of prefs in common and m+n
>> is the total number of prefs in the two vectors, and the Pearson correlation
>> of the intersection, wouldn't the product of these two approximate the true
>> cosine similarity taking into account the ratings?
>>
>
> That isn't log-likelihood.
>
> It is reasonable to use something like  (LLR > 10) * pearson as a measure.
>  What this does is sparsify the pearson measure to only contain interesting
> values.
>
>
>>
>> The main problem with the likelihood is that it does not take into account
>> one user disliking and another user liking the same item.  This seems to be
>> more important in dealing with very sparse data.  However, I do understand
>> the motivation, especially given that users more generally rate what they
>> like and less what they dislike.
>>
>
> This is unfortunately really complicated.  If you are talking about ratings,
> then negative ratings tell you more about what somebody likes than about
> what they dislike.  If you are talking about implicit data, then negative
> ratings are all items with which the user *might* have interacted (i.e.
> roughly a skazillion things).  Mostly they don't interact with these things
> because something else caught their eye or they are in a bad mood.  That
> doesn't mean much.
>
>
>>
>> Just trying to get a more intuitive feel for CF.  Hopefully these questions
>> are not way off base...
>>
>
> They are good.
>



-- 
Lance Norskog
goksron@gmail.com

Re: Sparse data & Item Similarity

Posted by Dinesh B Vadhia <di...@hotmail.com>.

>> 
>> The main problem with the likelihood is that it does not take into account
>> one user disliking and another user liking the same item.  This seems to be
>> more important in dealing with very sparse data.  However, I do understand
>> the motivation, especially given that users more generally rate what they
>> like and less what they dislike.
>>

> This is unfortunately really complicated.  If you are talking about ratings,
> then negative ratings tell you more about what somebody likes than about
> what they dislike.  If you are talking about implicit data, then negative
> ratings are all items with which the user *might* have interacted (i.e.
> roughly a skazillion things).  Mostly they don't interact with these things
> because something else caught their eye or they are in a bad mood.  That
> doesn't mean much.

Its not really complicated to take into account one user liking and another user disliking the same item if an appropriate model is used.  The problem with CF is that its support of sparsity is poor particualrly for big data.

From: Ted Dunning 
Sent: Wednesday, February 16, 2011 11:33 PM
To: user@mahout.apache.org 
Cc: Chris Schilling 
Subject: Re: Sparse data & Item Similarity

On Wed, Feb 16, 2011 at 9:58 PM, Chris Schilling <ch...@cellixis.com> wrote:

> First.  It is apparent that when dealing with sparse data, which most CF
> systems seem to, the Pearson/cosine/Euclidean similarity metrics are not
> extremely useful.  They do seem to be very useful, however, when dealing
> with dense vectors/matrices.
>

Seems about right.

> One question I have regarding the cosine similarity: it seems this is
> calculated with respect to the intersection of the two vectors.  What would
> happen if we actually divided the dot product by the total magnitudes (i.e.
> not just the magnitude of the intersection)?  Wouldn't that place more
> weight on the vectors which have more ratings in common?
>

Cosine is defined as the dot product over the product of the L_2 magnitudes
so it is normalized to the -1 to 1 range.

That isn't really the problem.  The problem is cases where you have two
users to rated (interacted with) exactly one item and that happens to be the
same item.

You can divide by the product of the L_1 or L_0 norms, but that doesn't
change the situation much.

Second: I agree that the likelihood approach (i.e. boolean preferences)
> helps a lot with sparse data.  So, my question is given a simple
> log-likelihood log(r/m+n) where r is the number of prefs in common and m+n
> is the total number of prefs in the two vectors, and the Pearson correlation
> of the intersection, wouldn't the product of these two approximate the true
> cosine similarity taking into account the ratings?
>

That isn't log-likelihood.

It is reasonable to use something like  (LLR > 10) * pearson as a measure.
 What this does is sparsify the pearson measure to only contain interesting
values.

>
> The main problem with the likelihood is that it does not take into account
> one user disliking and another user liking the same item.  This seems to be
> more important in dealing with very sparse data.  However, I do understand
> the motivation, especially given that users more generally rate what they
> like and less what they dislike.
>

This is unfortunately really complicated.  If you are talking about ratings,
then negative ratings tell you more about what somebody likes than about
what they dislike.  If you are talking about implicit data, then negative
ratings are all items with which the user *might* have interacted (i.e.
roughly a skazillion things).  Mostly they don't interact with these things
because something else caught their eye or they are in a bad mood.  That
doesn't mean much.

>
> Just trying to get a more intuitive feel for CF.  Hopefully these questions
> are not way off base...
>

They are good.

Re: Sparse data & Item Similarity

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Feb 16, 2011 at 9:58 PM, Chris Schilling <ch...@cellixis.com> wrote:

> First.  It is apparent that when dealing with sparse data, which most CF
> systems seem to, the Pearson/cosine/Euclidean similarity metrics are not
> extremely useful.  They do seem to be very useful, however, when dealing
> with dense vectors/matrices.
>

Seems about right.

> One question I have regarding the cosine similarity: it seems this is
> calculated with respect to the intersection of the two vectors.  What would
> happen if we actually divided the dot product by the total magnitudes (i.e.
> not just the magnitude of the intersection)?  Wouldn't that place more
> weight on the vectors which have more ratings in common?
>

Cosine is defined as the dot product over the product of the L_2 magnitudes
so it is normalized to the -1 to 1 range.

That isn't really the problem.  The problem is cases where you have two
users to rated (interacted with) exactly one item and that happens to be the
same item.

You can divide by the product of the L_1 or L_0 norms, but that doesn't
change the situation much.

Second: I agree that the likelihood approach (i.e. boolean preferences)
> helps a lot with sparse data.  So, my question is given a simple
> log-likelihood log(r/m+n) where r is the number of prefs in common and m+n
> is the total number of prefs in the two vectors, and the Pearson correlation
> of the intersection, wouldn't the product of these two approximate the true
> cosine similarity taking into account the ratings?
>

That isn't log-likelihood.

It is reasonable to use something like  (LLR > 10) * pearson as a measure.
 What this does is sparsify the pearson measure to only contain interesting
values.

>
> The main problem with the likelihood is that it does not take into account
> one user disliking and another user liking the same item.  This seems to be
> more important in dealing with very sparse data.  However, I do understand
> the motivation, especially given that users more generally rate what they
> like and less what they dislike.
>

This is unfortunately really complicated.  If you are talking about ratings,
then negative ratings tell you more about what somebody likes than about
what they dislike.  If you are talking about implicit data, then negative
ratings are all items with which the user *might* have interacted (i.e.
roughly a skazillion things).  Mostly they don't interact with these things
because something else caught their eye or they are in a bad mood.  That
doesn't mean much.

>
> Just trying to get a more intuitive feel for CF.  Hopefully these questions
> are not way off base...
>

They are good.

Re: Sparse data & Item Similarity

Posted by Chris Schilling <ch...@cellixis.com>.

So, 

I am currently enthralled by this discussion.  I just have a few questions regarding the use of similarity metrics in CF.  

First.  It is apparent that when dealing with sparse data, which most CF systems seem to, the Pearson/cosine/Euclidean similarity metrics are not extremely useful.  They do seem to be very useful, however, when dealing with dense vectors/matrices.  

One question I have regarding the cosine similarity: it seems this is calculated with respect to the intersection of the two vectors.  What would happen if we actually divided the dot product by the total magnitudes (i.e. not just the magnitude of the intersection)?  Wouldn't that place more weight on the vectors which have more ratings in common?

Second: I agree that the likelihood approach (i.e. boolean preferences) helps a lot with sparse data.  So, my question is given a simple log-likelihood log(r/m+n) where r is the number of prefs in common and m+n is the total number of prefs in the two vectors, and the Pearson correlation of the intersection, wouldn't the product of these two approximate the true cosine similarity taking into account the ratings? 

The main problem with the likelihood is that it does not take into account one user disliking and another user liking the same item.  This seems to be more important in dealing with very sparse data.  However, I do understand the motivation, especially given that users more generally rate what they like and less what they dislike. 

Just trying to get a more intuitive feel for CF.  Hopefully these questions are not way off base...

Thanks for all the help, great work!
Chris

On Feb 16, 2011, at 9:28 PM, Lance Norskog wrote:

> If I was the business, I would analyze the "put in cart but did not
> buy" list. Negative ratings are just as useful as positive ratings.
> Possibly this gives a +1/-1 ternary value?
> 
> On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <te...@gmail.com> wrote:
>> My experience is that there is a very small number of events that indicates real engagement. Using them in the form of Boolean preferences helps results. A lot.
>> 
>> Using all of the other events that do not indicate engagement is a total waste of resources because you are simply teaching the machine about things you don't care about.
>> 
>> Moreover there are probably some kinds of events that vastly outnumber others. Events that are less than 1% of your can matter bit often not.
>> 
>> The valuable secret sauce you will gain is which events are which. Which make your system sing and which ones just clog up the drains.
>> 
>> Matthew wrote:
>> users can do.. "view", "add to cart", and "buy" which I've assigned
>> different preference values to. Perhaps it would be better to simply
>> use boolean yes/no in my case?
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Sparse data & Item Similarity

Posted by Lance Norskog <go...@gmail.com>.

The data in this instance is: "the customer decides to buy it and then
changes his mind". For recommendation this is probably not useful. For
business intelligence it's gold: "what is it about this item that
makes someone say hhmmmmmmm.... no'?".

In the case of netflix ratings, there were these wonderful vectors
with chick flix are at one end and Star Trek at the other, or slasher
movies v.s. Harry Potter. What is a technique for finding those?

On Thu, Feb 17, 2011 at 8:18 AM, Ted Dunning <te...@gmail.com> wrote:
> Negative relevance judgements can only be very powerful if the things that
> you indicate you don't want are very close to the things you do want and are
> selected based on some evidence.
>
> Uniformly selected negative relevance judgements have very close to zero
> value.
>
> On Thu, Feb 17, 2011 at 12:57 AM, Dinesh B Vadhia <dineshbvadhia@hotmail.com
>> wrote:
>
>> Not necessarily.  Ordering breakfast by indicating all the things I don't
>> want to eat is "negative relevance feedback" and can be very powerful.
>>
>>
>>
>> From: Ted Dunning
>> Sent: Wednesday, February 16, 2011 11:27 PM
>> To: user@mahout.apache.org
>> Cc: Lance Norskog
>> Subject: Re: Sparse data & Item Similarity
>>
>>
>> Actually, almost all implicit negative ratings are very close to useless.
>>
>> The analogy would be ordering breakfast in a diner by saying all the things
>> you don't want to eat to a waitress.  The waitress will shortly yearn for a
>> positive rating.
>>
>> On Wed, Feb 16, 2011 at 9:28 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>> > If I was the business, I would analyze the "put in cart but did not
>> > buy" list. Negative ratings are just as useful as positive ratings.
>> > Possibly this gives a +1/-1 ternary value?
>> >
>> > On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> > > My experience is that there is a very small number of events that
>> > indicates real engagement. Using them in the form of Boolean preferences
>> > helps results. A lot.
>> > >
>> > > Using all of the other events that do not indicate engagement is a
>> total
>> > waste of resources because you are simply teaching the machine about
>> things
>> > you don't care about.
>> > >
>> > > Moreover there are probably some kinds of events that vastly outnumber
>> > others. Events that are less than 1% of your can matter bit often not.
>> > >
>> > > The valuable secret sauce you will gain is which events are which.
>> Which
>> > make your system sing and which ones just clog up the drains.
>> > >
>> > > Matthew wrote:
>> > > users can do.. "view", "add to cart", and "buy" which I've assigned
>> > > different preference values to. Perhaps it would be better to simply
>> > > use boolean yes/no in my case?
>> > >
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Sparse data & Item Similarity

Posted by Ted Dunning <te...@gmail.com>.

Negative relevance judgements can only be very powerful if the things that
you indicate you don't want are very close to the things you do want and are
selected based on some evidence.

Uniformly selected negative relevance judgements have very close to zero
value.

On Thu, Feb 17, 2011 at 12:57 AM, Dinesh B Vadhia <dineshbvadhia@hotmail.com
> wrote:

> Not necessarily.  Ordering breakfast by indicating all the things I don't
> want to eat is "negative relevance feedback" and can be very powerful.
>
>
>
> From: Ted Dunning
> Sent: Wednesday, February 16, 2011 11:27 PM
> To: user@mahout.apache.org
> Cc: Lance Norskog
> Subject: Re: Sparse data & Item Similarity
>
>
> Actually, almost all implicit negative ratings are very close to useless.
>
> The analogy would be ordering breakfast in a diner by saying all the things
> you don't want to eat to a waitress.  The waitress will shortly yearn for a
> positive rating.
>
> On Wed, Feb 16, 2011 at 9:28 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > If I was the business, I would analyze the "put in cart but did not
> > buy" list. Negative ratings are just as useful as positive ratings.
> > Possibly this gives a +1/-1 ternary value?
> >
> > On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > My experience is that there is a very small number of events that
> > indicates real engagement. Using them in the form of Boolean preferences
> > helps results. A lot.
> > >
> > > Using all of the other events that do not indicate engagement is a
> total
> > waste of resources because you are simply teaching the machine about
> things
> > you don't care about.
> > >
> > > Moreover there are probably some kinds of events that vastly outnumber
> > others. Events that are less than 1% of your can matter bit often not.
> > >
> > > The valuable secret sauce you will gain is which events are which.
> Which
> > make your system sing and which ones just clog up the drains.
> > >
> > > Matthew wrote:
> > > users can do.. "view", "add to cart", and "buy" which I've assigned
> > > different preference values to. Perhaps it would be better to simply
> > > use boolean yes/no in my case?
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: Sparse data & Item Similarity

Posted by Dinesh B Vadhia <di...@hotmail.com>.

Not necessarily.  Ordering breakfast by indicating all the things I don't want to eat is "negative relevance feedback" and can be very powerful.



From: Ted Dunning 
Sent: Wednesday, February 16, 2011 11:27 PM
To: user@mahout.apache.org 
Cc: Lance Norskog 
Subject: Re: Sparse data & Item Similarity


Actually, almost all implicit negative ratings are very close to useless.

The analogy would be ordering breakfast in a diner by saying all the things
you don't want to eat to a waitress.  The waitress will shortly yearn for a
positive rating.

On Wed, Feb 16, 2011 at 9:28 PM, Lance Norskog <go...@gmail.com> wrote:

> If I was the business, I would analyze the "put in cart but did not
> buy" list. Negative ratings are just as useful as positive ratings.
> Possibly this gives a +1/-1 ternary value?
>
> On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > My experience is that there is a very small number of events that
> indicates real engagement. Using them in the form of Boolean preferences
> helps results. A lot.
> >
> > Using all of the other events that do not indicate engagement is a total
> waste of resources because you are simply teaching the machine about things
> you don't care about.
> >
> > Moreover there are probably some kinds of events that vastly outnumber
> others. Events that are less than 1% of your can matter bit often not.
> >
> > The valuable secret sauce you will gain is which events are which. Which
> make your system sing and which ones just clog up the drains.
> >
> > Matthew wrote:
> > users can do.. "view", "add to cart", and "buy" which I've assigned
> > different preference values to. Perhaps it would be better to simply
> > use boolean yes/no in my case?
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Sparse data & Item Similarity

Posted by Sean Owen <sr...@gmail.com>.

Yeah I personally had in mind counting explicit 1 star ratings for instance.

On 17 Feb 2011 07:28, "Ted Dunning" <te...@gmail.com> wrote:
> Actually, almost all implicit negative ratings are very close to useless.
>
> The analogy would be ordering breakfast in a diner by saying all the
things
> you don't want to eat to a waitress. The waitress will shortly yearn for a
> positive rating.
>
> On Wed, Feb 16, 2011 at 9:28 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> If I was the business, I would analyze the "put in cart but did not
>> buy" list. Negative ratings are just as useful as positive ratings.
>> Possibly this gives a +1/-1 ternary value?
>>
>> On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > My experience is that there is a very small number of events that
>> indicates real engagement. Using them in the form of Boolean preferences
>> helps results. A lot.
>> >
>> > Using all of the other events that do not indicate engagement is a
total
>> waste of resources because you are simply teaching the machine about
things
>> you don't care about.
>> >
>> > Moreover there are probably some kinds of events that vastly outnumber
>> others. Events that are less than 1% of your can matter bit often not.
>> >
>> > The valuable secret sauce you will gain is which events are which.
Which
>> make your system sing and which ones just clog up the drains.
>> >
>> > Matthew wrote:
>> > users can do.. "view", "add to cart", and "buy" which I've assigned
>> > different preference values to. Perhaps it would be better to simply
>> > use boolean yes/no in my case?
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>

Re: Sparse data & Item Similarity

Posted by Ted Dunning <te...@gmail.com>.

Actually, almost all implicit negative ratings are very close to useless.

The analogy would be ordering breakfast in a diner by saying all the things
you don't want to eat to a waitress.  The waitress will shortly yearn for a
positive rating.

On Wed, Feb 16, 2011 at 9:28 PM, Lance Norskog <go...@gmail.com> wrote:

> If I was the business, I would analyze the "put in cart but did not
> buy" list. Negative ratings are just as useful as positive ratings.
> Possibly this gives a +1/-1 ternary value?
>
> On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > My experience is that there is a very small number of events that
> indicates real engagement. Using them in the form of Boolean preferences
> helps results. A lot.
> >
> > Using all of the other events that do not indicate engagement is a total
> waste of resources because you are simply teaching the machine about things
> you don't care about.
> >
> > Moreover there are probably some kinds of events that vastly outnumber
> others. Events that are less than 1% of your can matter bit often not.
> >
> > The valuable secret sauce you will gain is which events are which. Which
> make your system sing and which ones just clog up the drains.
> >
> > Matthew wrote:
> > users can do.. "view", "add to cart", and "buy" which I've assigned
> > different preference values to. Perhaps it would be better to simply
> > use boolean yes/no in my case?
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Sparse data & Item Similarity

Posted by Lance Norskog <go...@gmail.com>.

If I was the business, I would analyze the "put in cart but did not
buy" list. Negative ratings are just as useful as positive ratings.
Possibly this gives a +1/-1 ternary value?

On Wed, Feb 16, 2011 at 8:07 PM, Ted Dunning <te...@gmail.com> wrote:
> My experience is that there is a very small number of events that indicates real engagement. Using them in the form of Boolean preferences helps results. A lot.
>
> Using all of the other events that do not indicate engagement is a total waste of resources because you are simply teaching the machine about things you don't care about.
>
> Moreover there are probably some kinds of events that vastly outnumber others. Events that are less than 1% of your can matter bit often not.
>
> The valuable secret sauce you will gain is which events are which. Which make your system sing and which ones just clog up the drains.
>
> Matthew wrote:
> users can do.. "view", "add to cart", and "buy" which I've assigned
> different preference values to. Perhaps it would be better to simply
> use boolean yes/no in my case?
>



-- 
Lance Norskog
goksron@gmail.com

Re: Sparse data & Item Similarity

Posted by Matthew Runo <ma...@gmail.com>.

That's a very good point - one that I've thought about but not done anything about as of yet. A "product view" could be negative as much as it could be positive from a user reco standpoint - but from an item similarity standpoint if you're shopping for plumbing widgets than everything you looked at is probably likely to be somehow related to plumbing widgets. 

One thing I also have not done is remove items that were ordered and then returned.

Is anyone willing to share some thoughts on other web-store inputs that might be good? Or maybe things that might appear good but backfire?

--Matthew Runo

On Feb 16, 2011, at 8:07 PM, Ted Dunning wrote:

> My experience is that there is a very small number of events that indicates real engagement. Using them in the form of Boolean preferences helps results. A lot. 
> 
> Using all of the other events that do not indicate engagement is a total waste of resources because you are simply teaching the machine about things you don't care about. 
> 
> Moreover there are probably some kinds of events that vastly outnumber others. Events that are less than 1% of your can matter bit often not. 
> 
> The valuable secret sauce you will gain is which events are which. Which make your system sing and which ones just clog up the drains.  
> 
> Matthew wrote:
> users can do.. "view", "add to cart", and "buy" which I've assigned
> different preference values to. Perhaps it would be better to simply
> use boolean yes/no in my case?

Re: Sparse data & Item Similarity

Posted by Ted Dunning <te...@gmail.com>.

My experience is that there is a very small number of events that indicates real engagement. Using them in the form of Boolean preferences helps results. A lot. 

Using all of the other events that do not indicate engagement is a total waste of resources because you are simply teaching the machine about things you don't care about. 

Moreover there are probably some kinds of events that vastly outnumber others. Events that are less than 1% of your can matter bit often not. 

The valuable secret sauce you will gain is which events are which. Which make your system sing and which ones just clog up the drains.  

Matthew wrote:
users can do.. "view", "add to cart", and "buy" which I've assigned
different preference values to. Perhaps it would be better to simply
use boolean yes/no in my case?

Re: Sparse data & Item Similarity

Posted by Matthew Runo <ma...@gmail.com>.

So I've only processed a tiny fraction of my data with the
LogLikelihoodSimilarity but already the output looks a lot better.

Do you think there's any benefit to storing things with small
similarities? For example, would it make sense to just filter out
things that are - say - less than 0.5? I would probably not recommend
items that are so dissimilar.

-Matthew Runo

On Wed, Feb 16, 2011 at 2:39 PM, Matthew Runo <ma...@gmail.com> wrote:
> Thank you for that suggestion. I have a few different actions that
> users can do.. "view", "add to cart", and "buy" which I've assigned
> different preference values to. Perhaps it would be better to simply
> use boolean yes/no in my case?
>
> I'll give the log likelihood stuff a try tonight and I'll report back
> in case anyone else runs into this issue.
>
> -Matthew Runo
>
> On Wed, Feb 16, 2011 at 2:31 PM, Chris Schilling <ch...@cellixis.com> wrote:
>> Mathew,
>>
>> I was running into a similar issue with my data.  I discussed it with Sean Owen offline and his advice was, in a nutshell, to use the log-likelihood similarity metric.  Since you describe your users as having only links, I assume you are not dealing with preference data.  So, with the boolean data, the log-likelihood metric works very well (in my case, which I am also dealing with very sparse data).   How do your results look if you try the likelihood approach?
>>
>> Hope this helps,
>> Chris
>>
>>
>> On Feb 16, 2011, at 2:24 PM, Matthew Runo wrote:
>>
>>> Hello folks -
>>>
>>> (I think that) I'm running into an issue with my user data being too
>>> sparse with my item-item similarity calculations. A typical item_id in
>>> my data might have about 2000 links to other items, but very few
>>> "combinations" of users have viewed the same products.
>>>
>>> For example we have two items, 1244 and 2319 - and there are only
>>> three users in common between them.
>>>
>>> So, there's only those three users who viewed both items. I'm
>>> assigning preferences to different types of actions in my data.. and
>>> since all three users did the same action towards the item, they have
>>> the same preference value. Maybe I just need to start with a bigger
>>> set of data to get more links between items in different "actions" in
>>> order to spread out the generated similarities? I'm using the
>>> EuclideanDistanceSimilarity to do the final computation.
>>>
>>> I think this is leading to a huge number of "1" values being returned.
>>> Nearly 72% of my item-item similarities are 1.0. I feel that this is
>>> invalid, but I'm not quite sure of the best way to attack it.
>>>
>>> There are some similarities of 1 where the items do not appear to be
>>> similar at all, and the best I've been able to come up with as to how
>>> the 1 came around was that there was only one user who had a link
>>> between them and so that one user.
>>>
>>> How many item-user-item combinations per item pair does it take to get
>>> good output?
>>>
>>> Sorry if I'm not quite describing my problem in the proper terms..
>>>
>>> --Matthew Runo
>>
>>
>

Re: Sparse data & Item Similarity

Posted by Matthew Runo <ma...@gmail.com>.

Thank you for that suggestion. I have a few different actions that
users can do.. "view", "add to cart", and "buy" which I've assigned
different preference values to. Perhaps it would be better to simply
use boolean yes/no in my case?

I'll give the log likelihood stuff a try tonight and I'll report back
in case anyone else runs into this issue.

-Matthew Runo

On Wed, Feb 16, 2011 at 2:31 PM, Chris Schilling <ch...@cellixis.com> wrote:
> Mathew,
>
> I was running into a similar issue with my data.  I discussed it with Sean Owen offline and his advice was, in a nutshell, to use the log-likelihood similarity metric.  Since you describe your users as having only links, I assume you are not dealing with preference data.  So, with the boolean data, the log-likelihood metric works very well (in my case, which I am also dealing with very sparse data).   How do your results look if you try the likelihood approach?
>
> Hope this helps,
> Chris
>
>
> On Feb 16, 2011, at 2:24 PM, Matthew Runo wrote:
>
>> Hello folks -
>>
>> (I think that) I'm running into an issue with my user data being too
>> sparse with my item-item similarity calculations. A typical item_id in
>> my data might have about 2000 links to other items, but very few
>> "combinations" of users have viewed the same products.
>>
>> For example we have two items, 1244 and 2319 - and there are only
>> three users in common between them.
>>
>> So, there's only those three users who viewed both items. I'm
>> assigning preferences to different types of actions in my data.. and
>> since all three users did the same action towards the item, they have
>> the same preference value. Maybe I just need to start with a bigger
>> set of data to get more links between items in different "actions" in
>> order to spread out the generated similarities? I'm using the
>> EuclideanDistanceSimilarity to do the final computation.
>>
>> I think this is leading to a huge number of "1" values being returned.
>> Nearly 72% of my item-item similarities are 1.0. I feel that this is
>> invalid, but I'm not quite sure of the best way to attack it.
>>
>> There are some similarities of 1 where the items do not appear to be
>> similar at all, and the best I've been able to come up with as to how
>> the 1 came around was that there was only one user who had a link
>> between them and so that one user.
>>
>> How many item-user-item combinations per item pair does it take to get
>> good output?
>>
>> Sorry if I'm not quite describing my problem in the proper terms..
>>
>> --Matthew Runo
>
>

Re: Sparse data & Item Similarity

Posted by Chris Schilling <ch...@cellixis.com>.

Mathew,

I was running into a similar issue with my data.  I discussed it with Sean Owen offline and his advice was, in a nutshell, to use the log-likelihood similarity metric.  Since you describe your users as having only links, I assume you are not dealing with preference data.  So, with the boolean data, the log-likelihood metric works very well (in my case, which I am also dealing with very sparse data).   How do your results look if you try the likelihood approach?  

Hope this helps,
Chris

 
On Feb 16, 2011, at 2:24 PM, Matthew Runo wrote:

> Hello folks -
> 
> (I think that) I'm running into an issue with my user data being too
> sparse with my item-item similarity calculations. A typical item_id in
> my data might have about 2000 links to other items, but very few
> "combinations" of users have viewed the same products.
> 
> For example we have two items, 1244 and 2319 - and there are only
> three users in common between them.
> 
> So, there's only those three users who viewed both items. I'm
> assigning preferences to different types of actions in my data.. and
> since all three users did the same action towards the item, they have
> the same preference value. Maybe I just need to start with a bigger
> set of data to get more links between items in different "actions" in
> order to spread out the generated similarities? I'm using the
> EuclideanDistanceSimilarity to do the final computation.
> 
> I think this is leading to a huge number of "1" values being returned.
> Nearly 72% of my item-item similarities are 1.0. I feel that this is
> invalid, but I'm not quite sure of the best way to attack it.
> 
> There are some similarities of 1 where the items do not appear to be
> similar at all, and the best I've been able to come up with as to how
> the 1 came around was that there was only one user who had a link
> between them and so that one user.
> 
> How many item-user-item combinations per item pair does it take to get
> good output?
> 
> Sorry if I'm not quite describing my problem in the proper terms..
> 
> --Matthew Runo