You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/07/03 17:39:25 UTC

Fwd: LLR

This is solved with one question at the end. Basically there are at least 3 ways to calculate LLR and hadoop itemsimilarity is the odd one.

Looking at Ted’s github example, the counts seems to be taken from the cooccurrences with the diagonal removed—so no self-cooc.

         val AtAdNoSelfCooc = dense(
         (0, 1, 0, 1, 0),
         (1, 0, 0, 0, 0),
         (0, 0, 0, 1, 0),
         (1, 0, 1, 0, 0),
         (0, 0, 0, 0, 0))

	//again using (1,0)

        // Ted’s code
        for (MatrixSlice row : cooccurrence) {
            for (Vector.Element element : row.vector().nonZeroes()) {
                long k11 = (long) element.get();// = 1
                long k12 = (long) (rowSums.get(row.index()) - k11);// =  0
                long k21 = (long) (colSums.get(element.index()) - k11);// = 1
                long k22 = (long) (total - k11 - k12 - k21);// = 2
		// k = 
		//       1, 0
                //       1, 2
                double score = LogLikelihood.rootLogLikelihoodRatio(k11, k12, k21, k22);
                element.set(score);
            }
        }

So the k matrix looks correct if the above assumptions are correct. But the Hadoop impl returns a slightly massaged value for LLR:

    // mrlegacy code for itemsimilarity
    double logLikelihood =
        LogLikelihood.logLikelihoodRatio(preferring1and2,
                                         preferring2 - preferring1and2,
                                         preferring1 - preferring1and2,
                                         numUsers - preferring1 - preferring2 + preferring1and2);
    return 1.0 - 1.0 / (1.0 + logLikelihood);

Notice no root LLR (same ranking so seems fine), also not sure why the 1 - 1… but plugging that in the R calc yields 0.6331746 The same value as hadoop itemsimilarity.

So the mystery is solved now the question is why the return "1.0 - 1.0 / (1.0 + logLikelihood);”

I will assume that at least for comparison with legacy code we want to do this but I’d like to know why.
 


Begin forwarded message:

From: Pat Ferrel <pa...@occamsmachete.com>
Subject: LLR
Date: July 2, 2014 at 11:56:44 AM PDT
To: Ted Dunning <te...@gmail.com>
Cc: Sebastian Schelter <ss...@apache.org>

Might as well add myself to the list of people asking for an LLR explanation. Hadoop itemsimilarity is returning different values than the Spark version on the small matrix below. I’m having a hard time sorting this out so if you can bear with me.

Let’s take the A’A case for simplicity. It looks like we want to calculate the LLR for each non-zero entries in the AtA matrix using counts we got from A. For example let’s take the case of item 1 = itemA and item 0 = itemB so the (1,0).

    //input matrix rows = users, columns = items
    val A = dense(
        (1, 1, 0, 0, 0),
        (0, 0, 1, 1, 0),
        (0, 0, 0, 0, 1),
        (1, 0, 0, 1, 0))

    val AtA = A.transpose().times(A)

    // AtA == AtAd:
    val AtAd = dense(
         (2, 1, 0, 1, 0),
         (1, 1, 0, 0, 0),
         (0, 0, 1, 1, 0),
         (1, 0, 1, 2, 0),
         (0, 0, 0, 0, 1))


It looks like Spark cooccurrence calculates for itemA = 1, itemB = 0, k = 


using hadoop itemsimilairty I get 0.6331745808516107, using the above k and rootLogLikelihoodRatio I get 1.3138083706198118, using logLikelihoodRatio it comes out (not surprisingly) 1.7260924347106847, which agrees with the R version from the Ted's blog. So either k is wrong or I’ve missed some other difference in hadoop v spark versions. I assume root or not doesn’t matter since the ranking is the same.

if you could tell me what are k11 … k22 for item (1,0) of AtA and how did you calculate them.

Re: LLR

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Hadoop return value maps LLR into 0..1 range—all questions are answered. So none of this was a bug, just different way to return values. 
 
On Jul 3, 2014, at 8:39 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

This is solved with one question at the end. Basically there are at least 3 ways to calculate LLR and hadoop itemsimilarity is the odd one.

Looking at Ted’s github example, the counts seems to be taken from the cooccurrences with the diagonal removed—so no self-cooc.

         val AtAdNoSelfCooc = dense(
         (0, 1, 0, 1, 0),
         (1, 0, 0, 0, 0),
         (0, 0, 0, 1, 0),
         (1, 0, 1, 0, 0),
         (0, 0, 0, 0, 0))

	//again using (1,0)

        // Ted’s code
        for (MatrixSlice row : cooccurrence) {
            for (Vector.Element element : row.vector().nonZeroes()) {
                long k11 = (long) element.get();// = 1
                long k12 = (long) (rowSums.get(row.index()) - k11);// =  0
                long k21 = (long) (colSums.get(element.index()) - k11);// = 1
                long k22 = (long) (total - k11 - k12 - k21);// = 2
		// k = 
		//       1, 0
                //       1, 2
                double score = LogLikelihood.rootLogLikelihoodRatio(k11, k12, k21, k22);
                element.set(score);
            }
        }

So the k matrix looks correct if the above assumptions are correct. But the Hadoop impl returns a slightly massaged value for LLR:

    // mrlegacy code for itemsimilarity
    double logLikelihood =
        LogLikelihood.logLikelihoodRatio(preferring1and2,
                                         preferring2 - preferring1and2,
                                         preferring1 - preferring1and2,
                                         numUsers - preferring1 - preferring2 + preferring1and2);
    return 1.0 - 1.0 / (1.0 + logLikelihood);

Notice no root LLR (same ranking so seems fine), also not sure why the 1 - 1… but plugging that in the R calc yields 0.6331746 The same value as hadoop itemsimilarity.

So the mystery is solved now the question is why the return "1.0 - 1.0 / (1.0 + logLikelihood);”

I will assume that at least for comparison with legacy code we want to do this but I’d like to know why.
 


Begin forwarded message:

From: Pat Ferrel <pa...@occamsmachete.com>
Subject: LLR
Date: July 2, 2014 at 11:56:44 AM PDT
To: Ted Dunning <te...@gmail.com>
Cc: Sebastian Schelter <ss...@apache.org>

Might as well add myself to the list of people asking for an LLR explanation. Hadoop itemsimilarity is returning different values than the Spark version on the small matrix below. I’m having a hard time sorting this out so if you can bear with me.

Let’s take the A’A case for simplicity. It looks like we want to calculate the LLR for each non-zero entries in the AtA matrix using counts we got from A. For example let’s take the case of item 1 = itemA and item 0 = itemB so the (1,0).

    //input matrix rows = users, columns = items
    val A = dense(
        (1, 1, 0, 0, 0),
        (0, 0, 1, 1, 0),
        (0, 0, 0, 0, 1),
        (1, 0, 0, 1, 0))

    val AtA = A.transpose().times(A)

    // AtA == AtAd:
    val AtAd = dense(
         (2, 1, 0, 1, 0),
         (1, 1, 0, 0, 0),
         (0, 0, 1, 1, 0),
         (1, 0, 1, 2, 0),
         (0, 0, 0, 0, 1))


It looks like Spark cooccurrence calculates for itemA = 1, itemB = 0, k = 
<mahout] 2014-07-02 11-29-11 2014-07-02 11-32-00.jpg>

using hadoop itemsimilairty I get 0.6331745808516107, using the above k and rootLogLikelihoodRatio I get 1.3138083706198118, using logLikelihoodRatio it comes out (not surprisingly) 1.7260924347106847, which agrees with the R version from the Ted's blog. So either k is wrong or I’ve missed some other difference in hadoop v spark versions. I assume root or not doesn’t matter since the ranking is the same.

if you could tell me what are k11 … k22 for item (1,0) of AtA and how did you calculate them.