You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Emerson Murphy-HIll (JIRA)" <ji...@apache.org> on 2010/06/28 03:23:49 UTC

[jira] Created: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

AbstractSimilarity improperly computes vector metrics
-----------------------------------------------------

                 Key: MAHOUT-430
                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
    Affects Versions: 0.4
            Reporter: Emerson Murphy-HIll


Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:

if (compare <= 0) {
  if (++xPrefIndex >= xLength) {
    break;
  }
...

The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:

if (compare <= 0) {
  if (++xPrefIndex >= xLength) {
    sumY2 += squareSumRest(yPrefs,yPrefIndex);
    break;
  }
...

private double squareSumRest(Preference[] preferences, int startingFrom) {
  double squareSum = 0;
  for(int i = startingFrom; i < preferences.length; i++){
    double val = preferences[i].getValue();
    squareSum += val*val;
  }
  return squareSum;
}

I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.

A couple of comments about these two methods:
1) They're really hard to reason about. Isn't there a simpler implementation?
2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883479#action_12883479 ] 

Hudson commented on MAHOUT-430:
-------------------------------

Integrated in Mahout-Quality #106 (See [http://hudson.zones.apache.org/hudson/job/Mahout-Quality/106/])
    MAHOUT-430


> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Emerson Murphy-HIll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883194#action_12883194 ] 

Emerson Murphy-HIll commented on MAHOUT-430:
--------------------------------------------

I see what you mean; I didn't make the distinction between zero values and missing values.

But I am actually using a PreferenceInferrer that fills in 0 for missing values. When I do that, the break still happens before the other dimensions are included in the calculations of sumX or sumY.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Emerson Murphy-HIll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883576#action_12883576 ] 

Emerson Murphy-HIll commented on MAHOUT-430:
--------------------------------------------

Does the itemSimilarity method have the same issue? I realize that it doesn't use an inferrer, but it does treat missing values as zero, so I think that it also needs to consider the remainder of not-finished PreferenceArray.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen reopened MAHOUT-430:
------------------------------


> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-430:
-----------------------------

    Priority: Minor  (was: Major)
    Due Date: 30/Jun/10

Hmm yeah that doesn't look right, in the case where you have the inferrer. Let me look at it again tonight and put in a fix if needed or remember why it's done that way.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-430.
------------------------------

      Assignee: Sean Owen
    Resolution: Not A Problem

But once you fall off the end of one, the remaining data points in the other array are irrelevant, since they have no counterpart in the one that ended. These metrics are only computed over the item/users that are common to both.

You can write a test to double-check but after looking at it again I'm certain it's correct.

It's probably the fastest implementation possible, and because of that, not the shortest. I don't think it's too messy, but have a crack at improving it while not comprising speed if you like. This is the best I could do after years of looking at this method.

The duplication is unfortunate but hard to refactor because they're doing slightly different things actually. Again open to suggestions but I have stared at this for a long time and, given the priority for speed, not been able to do better.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883135#action_12883135 ] 

Sean Owen commented on MAHOUT-430:
----------------------------------

All of these are only defined over dimensions with a value in both vectors. Other dimensions effectively don't exist. This is just how it is defined and why the sum of those other x or y values does not exist in the computation.

I think you are viewing missing values as 0 - but existing. While I'd suggest that's not a good idea for recommendations, this idea is implemented via PreferenceInferrer. Supply one that fills in 0 and you should get what you expect. 

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883578#action_12883578 ] 

Sean Owen commented on MAHOUT-430:
----------------------------------

Shouldn't be the same issue, no. Without an inferrer those preferences really don't exist, so their counterparts don't exist in the computation.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Emerson Murphy-HIll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883130#action_12883130 ] 

Emerson Murphy-HIll commented on MAHOUT-430:
--------------------------------------------

You are right in that the remaining data points are irrelevant for sumXY. But I believe that the data points *are* relevant for sumX and sumY, because these values should be computed for the entirety of both feature arrays.

This may not be the case for some similarity metrics, but it is for UncenteredCosineSimilarity, which is what I'm working with. For this metric, the full sumX and sumY are necessary, for instance, for computing the magnitude of each preference array (http://en.wikipedia.org/wiki/Cosine_similarity : here the ||A|| and ||B||).

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-430.
------------------------------

    Fix Version/s: 0.4
       Resolution: Fixed

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.