You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ahmed Abdeen Hamed <ah...@gmail.com> on 2012/03/22 22:18:38 UTC

Merging similarities from two different approaches

Hello,

I developed a recommender that computes the distance between two items
based on contents. However, I also need to include the association between
the user-item. But, when I do that, I end up having a similarity score from
the item-item content based and also another similarity score based on the
item-user association (loglikelihood). I am now designing some experiments
to consider different weights for each approach before I add them together.
Here is the mathematical model what I have in mind:

LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + logLikelihood)) +
(CONTENT_WEIGHT* content-proximity) such that

[1] LOGLIKELIHOOD_WEIGHT (weight between 0, 1 e.g., 0.6)

[2] CONTENT_WEIGHT (weight between 0, 1 e.g., 0.4)

[3] logLikelihood is a variable that gets populated by a logLikelihood
similarity metric based on the user-item association

[4] content-proximity is variable that gets populated by
a contents-based similarity algorithm (TFIDF).

My question now is: Does this mathematical model make sense? Can we add the
two different scores even though they are from two different distributions
the way I did above or the outcome will be skewed?

Please let me know if you have an answer for me.

Thanks very much,

-Ahmed

Re: Merging similarities from two different approaches

Posted by Ahmed Abdeen Hamed <ah...@gmail.com>.

Hi Ted,

Thanks for the help below. I ended using the binary representation
approach. First, I had to convert the similarity scores to percentages by
multiplying by 100 to turn fractions into decimals. After I computed the
binary representation of the two scores I took the OR then divided by 100
to convert to a similarity back. This seems to be working fine.

I also learned over the weekend that there is a theory to compute such a
score from scores computed from different domain. The theory is called
"Utility Theory and it uses a method called Kappa Statistics". I thought I
would share with everyone here

Thanks again for you help with this. It is very much appreciated.

-Ahmed

On Fri, Mar 23, 2012 at 6:32 PM, Ted Dunning <te...@gmail.com> wrote:

> My own recommendation is to reduce both scores to binary form using
> whatever sound statistical method you care to adopt and then use OR.
>
> A viable alternative that is relatively good is to convert both scores to
> percentiles with the same polarity (i.e. 99-th %-ile is very close or very
> similar).  Then transform both percentiles using the logit function to get
> unbounded real numbers.  The logit of p is just log(p / (1-p)) where p is
> in the range (0,1).  These transformed percentiles can be added with
> reasonable impunity and the result can interpreted pretty easily.  The time
> that this doesn't work so well is when one of the values is heavily
> quantized near the interesting end of the scale, but that problem is
> inherent in the data, not in the method.
>
> A similar result can be had by using -log(1-p) where p is the percentile
> in question.  For values of p near 1, this is approximately the same as
> using the logit function.  For values far from 1, we don't care what it
> means.
>
>
> On Fri, Mar 23, 2012 at 1:52 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> On Fri, Mar 23, 2012 at 8:33 PM, Ahmed Abdeen Hamed
>> <ah...@gmail.com> wrote:
>> > As for merging the scores, I need an OR rule, which translates to the
>> > addition. If I used AND that will make the likelihood smaller because
>> the
>> > probabilities will be multiplied. This will restrict the clusters to
>> items
>> > that appears in the intersection of content-based similarity AND sales
>> > correlations. Does this sound right to you?
>>
>> Not really, because of course you multiply probabilities in all cases.
>> Yes, all similarities would be smaller in absolute term, but that's
>> fine -- the absolute value does not matter.
>>
>> The problem with adding is that again it assumes the two terms are in
>> the same "units" and that is not clear here. The product doesn't
>> contain that assumption, at least.
>>
>> >
>> > A very important issue I am having now is about evaluation. How do we
>> > evaluate these clusters resulting from a TreeClusteringRecommender?
>> >
>>
>> In the context of recommenders, you don't. The clusters are not the
>> output, just a possible implementation by-product. You could compute
>> metrics like intra-cluster distance vs inter-cluster distance but I
>> don't know what it says about the quality of the recs.
>>
>> You should start with the standard rec eval code if you can.
>>
>
>

Re: Merging similarities from two different approaches

Posted by Ted Dunning <te...@gmail.com>.

My own recommendation is to reduce both scores to binary form using
whatever sound statistical method you care to adopt and then use OR.

A viable alternative that is relatively good is to convert both scores to
percentiles with the same polarity (i.e. 99-th %-ile is very close or very
similar).  Then transform both percentiles using the logit function to get
unbounded real numbers.  The logit of p is just log(p / (1-p)) where p is
in the range (0,1).  These transformed percentiles can be added with
reasonable impunity and the result can interpreted pretty easily.  The time
that this doesn't work so well is when one of the values is heavily
quantized near the interesting end of the scale, but that problem is
inherent in the data, not in the method.

A similar result can be had by using -log(1-p) where p is the percentile in
question.  For values of p near 1, this is approximately the same as using
the logit function.  For values far from 1, we don't care what it means.

On Fri, Mar 23, 2012 at 1:52 PM, Sean Owen <sr...@gmail.com> wrote:

> On Fri, Mar 23, 2012 at 8:33 PM, Ahmed Abdeen Hamed
> <ah...@gmail.com> wrote:
> > As for merging the scores, I need an OR rule, which translates to the
> > addition. If I used AND that will make the likelihood smaller because the
> > probabilities will be multiplied. This will restrict the clusters to
> items
> > that appears in the intersection of content-based similarity AND sales
> > correlations. Does this sound right to you?
>
> Not really, because of course you multiply probabilities in all cases.
> Yes, all similarities would be smaller in absolute term, but that's
> fine -- the absolute value does not matter.
>
> The problem with adding is that again it assumes the two terms are in
> the same "units" and that is not clear here. The product doesn't
> contain that assumption, at least.
>
> >
> > A very important issue I am having now is about evaluation. How do we
> > evaluate these clusters resulting from a TreeClusteringRecommender?
> >
>
> In the context of recommenders, you don't. The clusters are not the
> output, just a possible implementation by-product. You could compute
> metrics like intra-cluster distance vs inter-cluster distance but I
> don't know what it says about the quality of the recs.
>
> You should start with the standard rec eval code if you can.
>

Re: Merging similarities from two different approaches

Posted by Sean Owen <sr...@gmail.com>.

On Fri, Mar 23, 2012 at 8:33 PM, Ahmed Abdeen Hamed
<ah...@gmail.com> wrote:
> As for merging the scores, I need an OR rule, which translates to the
> addition. If I used AND that will make the likelihood smaller because the
> probabilities will be multiplied. This will restrict the clusters to items
> that appears in the intersection of content-based similarity AND sales
> correlations. Does this sound right to you?

Not really, because of course you multiply probabilities in all cases.
Yes, all similarities would be smaller in absolute term, but that's
fine -- the absolute value does not matter.

The problem with adding is that again it assumes the two terms are in
the same "units" and that is not clear here. The product doesn't
contain that assumption, at least.

>
> A very important issue I am having now is about evaluation. How do we
> evaluate these clusters resulting from a TreeClusteringRecommender?
>

In the context of recommenders, you don't. The clusters are not the
output, just a possible implementation by-product. You could compute
metrics like intra-cluster distance vs inter-cluster distance but I
don't know what it says about the quality of the recs.

You should start with the standard rec eval code if you can.

Re: Merging similarities from two different approaches

Posted by Ahmed Abdeen Hamed <ah...@gmail.com>.

Hello Sean,

Thanks very much for the detailed response! The proximity() is actually a
similarity metric not a distance one. In my earlier implementations, I used
distance tfidf.distance, hence the comment you saw in the code says
distance.

I am working on decomposing the content based implementation from sales
based implementation. So, thank you for that.

As for merging the scores, I need an OR rule, which translates to the
addition. If I used AND that will make the likelihood smaller because the
probabilities will be multiplied. This will restrict the clusters to items
that appears in the intersection of content-based similarity AND sales
correlations. Does this sound right to you?

A very important issue I am having now is about evaluation. How do we
evaluate these clusters resulting from a TreeClusteringRecommender?

I would appreciate any insight.

Thanks so much for this lively discussion!

-Ahmed

On Thu, Mar 22, 2012 at 6:17 PM, Sean Owen <sr...@gmail.com> wrote:

> Yes, but you can't use it as both things at once. I meant that you
> swap them at the broadest level -- at your original input. So all
> "items" are really users and vice versa. At the least you need two
> separate implementations, encapsulating two different notions of
> similarity.
>
> Similarity is item-item or user-user, not item-item. It makes some
> sense to implement item-item similarity based on tags, so the first
> half of the method looks OK (excepting I'd expect you to implement
> itemSimilarity()).
>
> I think the other half makes more sense if you are calling
> getUsersForItem() -- input is item, output are users.
>
> As for the final line -- my original comment stands, though it's right
> for a wrong reason. You are not combining two distances here. You're
> combining a similarity value and a distance (right? proximity is a
> distance function?) and that's definitely not right. They go opposite
> ways: big distance means small similarity.
>
> If you handle two similarities, the simple thing that is in the
> ballpark on theoretically sound is to take their product.
>
>
> On Thu, Mar 22, 2012 at 9:48 PM, Ahmed Abdeen Hamed
> <ah...@gmail.com> wrote:
> > You are correct. In a previous post, I inquired about the use of
> > TreeClusteringRecommender which is based upon a UserSimilarity metrix. My
> > question was whether I can use it for ItemSimialrity, and your answer was
> > yes, just feed the itemID as a userID and vice versa and that's what I am
> > doing in it the method. This is what this code is doing
> >
> > The purpose of this method is to derive a similarity that is based on
> item
> > attributes (name, brand, category) in addition to what the loglikelihood
> > offers, so I am guaranteed to be getting recommendations for items such
> as
> > ("The Matrix", and "The Matrix Reloaded") if they never co-occur in the
> data
> > model. This is why I need to merge to the two scores somehow.
> >
> > Thanks again!
> > Ahmed
>

Re: Merging similarities from two different approaches

Posted by Sean Owen <sr...@gmail.com>.

Yes, but you can't use it as both things at once. I meant that you
swap them at the broadest level -- at your original input. So all
"items" are really users and vice versa. At the least you need two
separate implementations, encapsulating two different notions of
similarity.

Similarity is item-item or user-user, not item-item. It makes some
sense to implement item-item similarity based on tags, so the first
half of the method looks OK (excepting I'd expect you to implement
itemSimilarity()).

I think the other half makes more sense if you are calling
getUsersForItem() -- input is item, output are users.

As for the final line -- my original comment stands, though it's right
for a wrong reason. You are not combining two distances here. You're
combining a similarity value and a distance (right? proximity is a
distance function?) and that's definitely not right. They go opposite
ways: big distance means small similarity.

If you handle two similarities, the simple thing that is in the
ballpark on theoretically sound is to take their product.

On Thu, Mar 22, 2012 at 9:48 PM, Ahmed Abdeen Hamed
<ah...@gmail.com> wrote:
> You are correct. In a previous post, I inquired about the use of
> TreeClusteringRecommender which is based upon a UserSimilarity metrix. My
> question was whether I can use it for ItemSimialrity, and your answer was
> yes, just feed the itemID as a userID and vice versa and that's what I am
> doing in it the method. This is what this code is doing
>
> The purpose of this method is to derive a similarity that is based on item
> attributes (name, brand, category) in addition to what the loglikelihood
> offers, so I am guaranteed to be getting recommendations for items such as
> ("The Matrix", and "The Matrix Reloaded") if they never co-occur in the data
> model. This is why I need to merge to the two scores somehow.
>
> Thanks again!
> Ahmed

Re: Merging similarities from two different approaches

Posted by Ahmed Abdeen Hamed <ah...@gmail.com>.

You are correct. In a previous post, I inquired about the use of
TreeClusteringRecommender which is based upon a UserSimilarity metrix. My
question was whether I can use it for ItemSimialrity, and your answer was
yes, just feed the itemID as a userID and vice versa and that's what I am
doing in it the method. This is what this code is doing

The purpose of this method is to derive a similarity that is based on item
attributes (name, brand, category) in addition to what the loglikelihood
offers, so I am guaranteed to be getting recommendations for items such as
("The Matrix", and "The Matrix Reloaded") if they never co-occur in the
data model. This is why I need to merge to the two scores somehow.

Thanks again!
Ahmed

On Thu, Mar 22, 2012 at 5:38 PM, Sean Owen <sr...@gmail.com> wrote:

> You're implementing userSimilarity(), but appear to be computing
> item-item similarity. Halfway through, you use the item IDs as user
> IDs. I can't see what this is intending to do as a result?
>
>

Re: Merging similarities from two different approaches

Posted by Sean Owen <sr...@gmail.com>.

You're implementing userSimilarity(), but appear to be computing
item-item similarity. Halfway through, you use the item IDs as user
IDs. I can't see what this is intending to do as a result?

On Thu, Mar 22, 2012 at 9:33 PM, Ahmed Abdeen Hamed
<ah...@gmail.com> wrote:
> Hello Sean,
>
> I am trying to cluster not only based on sales data in a data model, but
> also based on a content as well. Down below is the function that is doing
> that:
>
> @Override
>
> public double userSimilarity(long itemID1, long itemID2) throws
> TasteException {
>
> // converting the item ids from long to String
>
> String itemOneID = String.valueOf(itemID1);
>
> String itemTwoID = String.valueOf(itemID2);
>
>
> // looking up the ids in the hashmap
>
> String itemOneValue = productIdAttributesMap.get(itemOneID);
>
> String itemTwoValue = productIdAttributesMap.get(itemTwoID);
>
>
> // load the tfidf object with many documents
>
> for(String s: productIdAttributesMap.values())
>
> tfIdf.handle(s);
>
>
> // compute the distance and return it...
>
> double proximity = 0;
>
> if(itemOneValue!=null && itemTwoValue!=null){
>
> proximity = tfIdf.proximity(itemOneValue, itemTwoValue);
>
> }
>
>
> // now computing similarity between items from sales data
>
> DataModel dataModel = getDataModel();
>
> FastIDSet prefs1 = dataModel.getItemIDsFromUser(itemID1);
>
> FastIDSet prefs2 = dataModel.getItemIDsFromUser(itemID2);
>
>
> long prefs1Size = prefs1.size();
>
> long prefs2Size = prefs2.size();
>
> long intersectionSize =
>
> prefs1Size < prefs2Size ? prefs2.intersectionSize(prefs1) :
> prefs1.intersectionSize(prefs2);
>
> if (intersectionSize == 0) {
>
> return Double.NaN;
>
> }
>
> long numItems = dataModel.getNumItems();
>
> double logLikelihood =
>
> LogLikelihood.logLikelihoodRatio(intersectionSize,
>
> prefs2Size - intersectionSize,
>
> prefs1Size - intersectionSize,
>
> numItems - prefs1Size - prefs2Size + intersectionSize);
>
> // merging the distance and the loglikelihood similarity
>
> return ExperimentParams.LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 +
> logLikelihood)) + (ExperimentParams.PROXIMITY_WEIGHT * proximity);
>
>
> }
>
>
> Please let me know if this is clearer now.
>
> Thanks very much,
>
> -Ahmed
>
>
> On Thu, Mar 22, 2012 at 5:26 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>> What do you mean that you have a user-item association from a
>> log-likelihood metric?
>>
>> Combining two values is easy in the sense that you can average them or
>> something, but only if they are in the same "units". Log likelihood
>> may be viewed as a probability. The distance function you derive from
>> it -- and your own TFIDF distance -- it's not clear if these are
>> comparable.
>>
>> Rather than get into this, I wonder whether you need any of this at
>> all, since I'm not sure what the user-item value is to begin with.
>> That's your output, not an input.
>>
>> On Thu, Mar 22, 2012 at 9:18 PM, Ahmed Abdeen Hamed
>> <ah...@gmail.com> wrote:
>> > Hello,
>> >
>> > I developed a recommender that computes the distance between two items
>> > based on contents. However, I also need to include the association
>> > between
>> > the user-item. But, when I do that, I end up having a similarity score
>> > from
>> > the item-item content based and also another similarity score based on
>> > the
>> > item-user association (loglikelihood). I am now designing some
>> > experiments
>> > to consider different weights for each approach before I add them
>> > together.
>> > Here is the mathematical model what I have in mind:
>> >
>> > LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + logLikelihood)) +
>> > (CONTENT_WEIGHT* content-proximity) such that
>> >
>> > [1] LOGLIKELIHOOD_WEIGHT (weight between 0, 1 e.g., 0.6)
>> >
>> > [2] CONTENT_WEIGHT (weight between 0, 1 e.g., 0.4)
>> >
>> > [3] logLikelihood is a variable that gets populated by a logLikelihood
>> > similarity metric based on the user-item association
>> >
>> > [4] content-proximity is variable that gets populated by
>> > a contents-based similarity algorithm (TFIDF).
>> >
>> > My question now is: Does this mathematical model make sense? Can we add
>> > the
>> > two different scores even though they are from two different
>> > distributions
>> > the way I did above or the outcome will be skewed?
>> >
>> > Please let me know if you have an answer for me.
>> >
>> > Thanks very much,
>> >
>> > -Ahmed
>
>

Re: Merging similarities from two different approaches

Posted by Ahmed Abdeen Hamed <ah...@gmail.com>.

Hello Sean,

I am trying to cluster not only based on sales data in a data model, but
also based on a content as well. Down below is the function that is doing
that:

 @Override

public double userSimilarity(long itemID1, long itemID2) throwsTasteException {

 // converting the item ids from long to String

 String itemOneID = String.valueOf(itemID1);

 String itemTwoID = String.valueOf(itemID2);


 // looking up the ids in the hashmap

 String itemOneValue = productIdAttributesMap.get(itemOneID);

 String itemTwoValue = productIdAttributesMap.get(itemTwoID);


 // load the tfidf object with many documents

 for(String s: productIdAttributesMap.values())

 tfIdf.handle(s);


 // compute the distance and return it...

 double proximity = 0;

 if(itemOneValue!=null && itemTwoValue!=null){

 proximity = tfIdf.proximity(itemOneValue, itemTwoValue);

 }


// now computing similarity between items from sales data

 DataModel dataModel = getDataModel();

 FastIDSet prefs1 = dataModel.getItemIDsFromUser(itemID1);

 FastIDSet prefs2 = dataModel.getItemIDsFromUser(itemID2);


 long prefs1Size = prefs1.size();

 long prefs2Size = prefs2.size();

 long intersectionSize =

 prefs1Size < prefs2Size ? prefs2.intersectionSize(prefs1) :
prefs1.intersectionSize(prefs2);

 if (intersectionSize == 0) {

  return Double.NaN;

 }

 long numItems = dataModel.getNumItems();

 double logLikelihood =

  LogLikelihood.logLikelihoodRatio(intersectionSize,

   prefs2Size - intersectionSize,

   prefs1Size - intersectionSize,

   numItems - prefs1Size - prefs2Size + intersectionSize);

// merging the distance and the loglikelihood similarity

 return ExperimentParams.LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 +
logLikelihood)) + (ExperimentParams.PROXIMITY_WEIGHT * proximity);


 }

Please let me know if this is clearer now.

Thanks very much,

-Ahmed


On Thu, Mar 22, 2012 at 5:26 PM, Sean Owen <sr...@gmail.com> wrote:

> What do you mean that you have a user-item association from a
> log-likelihood metric?
>
> Combining two values is easy in the sense that you can average them or
> something, but only if they are in the same "units". Log likelihood
> may be viewed as a probability. The distance function you derive from
> it -- and your own TFIDF distance -- it's not clear if these are
> comparable.
>
> Rather than get into this, I wonder whether you need any of this at
> all, since I'm not sure what the user-item value is to begin with.
> That's your output, not an input.
>
> On Thu, Mar 22, 2012 at 9:18 PM, Ahmed Abdeen Hamed
> <ah...@gmail.com> wrote:
> > Hello,
> >
> > I developed a recommender that computes the distance between two items
> > based on contents. However, I also need to include the association
> between
> > the user-item. But, when I do that, I end up having a similarity score
> from
> > the item-item content based and also another similarity score based on
> the
> > item-user association (loglikelihood). I am now designing some
> experiments
> > to consider different weights for each approach before I add them
> together.
> > Here is the mathematical model what I have in mind:
> >
> > LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + logLikelihood)) +
> > (CONTENT_WEIGHT* content-proximity) such that
> >
> > [1] LOGLIKELIHOOD_WEIGHT (weight between 0, 1 e.g., 0.6)
> >
> > [2] CONTENT_WEIGHT (weight between 0, 1 e.g., 0.4)
> >
> > [3] logLikelihood is a variable that gets populated by a logLikelihood
> > similarity metric based on the user-item association
> >
> > [4] content-proximity is variable that gets populated by
> > a contents-based similarity algorithm (TFIDF).
> >
> > My question now is: Does this mathematical model make sense? Can we add
> the
> > two different scores even though they are from two different
> distributions
> > the way I did above or the outcome will be skewed?
> >
> > Please let me know if you have an answer for me.
> >
> > Thanks very much,
> >
> > -Ahmed
>

Re: Merging similarities from two different approaches

Posted by Sean Owen <sr...@gmail.com>.

What do you mean that you have a user-item association from a
log-likelihood metric?

Combining two values is easy in the sense that you can average them or
something, but only if they are in the same "units". Log likelihood
may be viewed as a probability. The distance function you derive from
it -- and your own TFIDF distance -- it's not clear if these are
comparable.

Rather than get into this, I wonder whether you need any of this at
all, since I'm not sure what the user-item value is to begin with.
That's your output, not an input.

On Thu, Mar 22, 2012 at 9:18 PM, Ahmed Abdeen Hamed
<ah...@gmail.com> wrote:
> Hello,
>
> I developed a recommender that computes the distance between two items
> based on contents. However, I also need to include the association between
> the user-item. But, when I do that, I end up having a similarity score from
> the item-item content based and also another similarity score based on the
> item-user association (loglikelihood). I am now designing some experiments
> to consider different weights for each approach before I add them together.
> Here is the mathematical model what I have in mind:
>
> LOGLIKELIHOOD_WEIGHT*(1.0 - 1.0 / (1.0 + logLikelihood)) +
> (CONTENT_WEIGHT* content-proximity) such that
>
> [1] LOGLIKELIHOOD_WEIGHT (weight between 0, 1 e.g., 0.6)
>
> [2] CONTENT_WEIGHT (weight between 0, 1 e.g., 0.4)
>
> [3] logLikelihood is a variable that gets populated by a logLikelihood
> similarity metric based on the user-item association
>
> [4] content-proximity is variable that gets populated by
> a contents-based similarity algorithm (TFIDF).
>
> My question now is: Does this mathematical model make sense? Can we add the
> two different scores even though they are from two different distributions
> the way I did above or the outcome will be skewed?
>
> Please let me know if you have an answer for me.
>
> Thanks very much,
>
> -Ahmed