You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2010/06/28 18:02:50 UTC

[jira] Updated: (MAHOUT-430) AbstractSimilarity improperly computes vector metrics

     [ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-430:
-----------------------------

    Priority: Minor  (was: Major)
    Due Date: 30/Jun/10

Hmm yeah that doesn't look right, in the case where you have the inferrer. Let me look at it again tonight and put in a fix if needed or remember why it's done that way.

> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
>                 Key: MAHOUT-430
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-430
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Emerson Murphy-HIll
>            Assignee: Sean Owen
>            Priority: Minor
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     break;
>   }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
>   if (++xPrefIndex >= xLength) {
>     sumY2 += squareSumRest(yPrefs,yPrefIndex);
>     break;
>   }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
>   double squareSum = 0;
>   for(int i = startingFrom; i < preferences.length; i++){
>     double val = preferences[i].getValue();
>     squareSum += val*val;
>   }
>   return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.