You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/08/29 20:27:54 UTC
Re: LLR

Great thanks,

Calculating TF-IDF on the input would seem to imply that the preference strength is being used as a proxy for term count, right? As you say there is some question about whether this should be done. There certainly are ways to preserve these “counts” when indexing happens in the search engine.

This issue seems to pop up a lot for two reasons:
1) people are still trying to use ratings as preference strengths; I tend to be dubious of this as a ranking method since so much effort has gone to little effect here.
2) people want to mix actions; we have a good way to handle this now and weights often don’t work—mixing actions may even make things worse.

Until your statements below I was strongly leaning towards ignoring preference strength and treating all as boolean/binary. Can you think of a case that I missed? When I mine for preference data I get ratings and keep only 4 or 5 out of 5 as unambiguous positive preferences tossing the rest as ambiguous.

As to normalizing the indicators, I have been turning that off for that field in Solr. Have to re-think that. 


On Aug 29, 2014, at 10:50 AM, Ted Dunning <te...@gmail.com> wrote:




On Fri, Aug 29, 2014 at 9:48 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
A[A’B] leaves users in rows so my mistake and  sparsifyWithLLR(A’B) is item similarity output.

Not sure what you mean by A diag(itemAweights) 

Intuitively speaking is this taking A and replacing any values (preference strengths) with IDF weights? Not TF-IDF because TF = 1. If so wouldn’t it be binarize(A) diag(itemAweights)? R noob so bear with me.

binarize(A) diag(itemAweights) gives x.idf scores with no normalization

A   diag(itemAweights) gives tf.idf scores with no normalization

Whether we binarize or not is an open question.  The search engine would not do so, but we might choose to do so before passing the user history to the search engine.



Then you do an actual multiply of idfWeightedA by sparsifyWithLLR(A’B)?

If you want to score, yes.  The only gotcha here is that you may want to normalize columns of  sparsifyWithLLR(A’B) the way that the search engine does.

 
If so this is interesting since it takes two matrices whose values are calculated with different methods A(IDF) and [A’B](LLR) and multiplies them. The multiply is, of course, valid but a similarity calc would either not be valid because of the different weighting methods or would have to discard the weights (LLR)?

The LLR is only useful as a filter, not for weighting.

Yes.  These come from different methods, but this provides complementarity, not contradiction.
 

Have I got this close to right?

Yes. Very close to right.
 

Using a search engine the sparsifyWithLLR(A’B) weights are discarded and indexing re-weights A’B with IDF (TF = 1) Then cosine is used in place of multiply.

Cosine is merely a normalization on either side.  Normalizing the query doesn't matter since it doesn't change ordering.  Normalizing the LLR side is probably useful.
 
There would seem to be some fidelity loss here over the R method you describe.

I view the R method as somewhat lossy rather than the search engine.  I think that the search engine is the one with proven performance.
 

I ask because if even I can understand it I can probably blog about it.


:-)

True.  Too true.

 


On Aug 28, 2014, at 5:45 PM, Ted Dunning <te...@gmail.com> wrote:


What I would do for recs for all users is more explicitly thus:

    A diag(itemAweights) sparsifyWithLLR(A'B) 

(gives a matrix with rows for users and columns for items)  
(each row contains scores for each item)

itemAweights = log(dim(A)+1) - log(rowSums(binarize(A))+1)

This is just IDF weighting, of course.



On Thu, Aug 28, 2014 at 4:36 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
Just got rowsimilarity checked in. It looks to me like the difference is that the transposes are swapped and the LLR count are either row-wise or column-wise.

itemsimilairty: [A’A] column-wise LLR counts
rowsimilarity: [AA’] row-wise counts

To take the next step and do recs you do [A’A]A’ for all recs to all users. Actually + [A’B]B’ and so on. Should these use LLR too? If so it suggests an abstraction replacing matrix multiply for certain cases.