You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jason Smith <ja...@gmail.com> on 2011/06/02 05:00:53 UTC

PearsonCorrelationSimilarity returning NaN for user similarity with perfect match

What is the reasoning behind PearsonCorrelationSimilarity  returning
NaN for userSimilarity when the two user's overlapping reviews match
up perfectly?
In my case of a limited set of rating values (1 to 5 stars) it seems
quite possible that a user with a smaller number of ratings might have
overlapping ratings with other users.  Am I missing something here.

 // Note that sum of X and sum of Y don't appear here since they are
assumed to be 0;
    // the data is assumed to be centered.
    double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
    if (denominator == 0.0) {
      // One or both parties has -all- the same ratings;
      // can't really say much similarity under this measure
      return Double.NaN;
    }
    return sumXY / denominator;

Re: PearsonCorrelationSimilarity returning NaN for user similarity with perfect match

Posted by Sean Owen <sr...@gmail.com>.
I assume one or both has all the same ratings, at least in the overlapping
items. This means the standard deviation of their ratings is undefined, and
that's part of the formula. I think the answer is, that's just how it's
defined.

This tends to happen when the users have little overlap -- 1-2 items. And
ignoring that as a similarity is generally good.

But yes this is a reason you might not choose this metric.

On Thu, Jun 2, 2011 at 4:00 AM, Jason Smith <ja...@gmail.com> wrote:

> What is the reasoning behind PearsonCorrelationSimilarity  returning
> NaN for userSimilarity when the two user's overlapping reviews match
> up perfectly?
> In my case of a limited set of rating values (1 to 5 stars) it seems
> quite possible that a user with a smaller number of ratings might have
> overlapping ratings with other users.  Am I missing something here.
>
>  // Note that sum of X and sum of Y don't appear here since they are
> assumed to be 0;
>    // the data is assumed to be centered.
>    double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
>    if (denominator == 0.0) {
>      // One or both parties has -all- the same ratings;
>      // can't really say much similarity under this measure
>      return Double.NaN;
>    }
>    return sumXY / denominator;
>