You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2010/06/28 18:02:50 UTC
[jira] Updated: (MAHOUT-430) AbstractSimilarity improperly computes
vector metrics
[ https://issues.apache.org/jira/browse/MAHOUT-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-430:
-----------------------------
Priority: Minor (was: Major)
Due Date: 30/Jun/10
Hmm yeah that doesn't look right, in the case where you have the inferrer. Let me look at it again tonight and put in a fix if needed or remember why it's done that way.
> AbstractSimilarity improperly computes vector metrics
> -----------------------------------------------------
>
> Key: MAHOUT-430
> URL: https://issues.apache.org/jira/browse/MAHOUT-430
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Emerson Murphy-HIll
> Assignee: Sean Owen
> Priority: Minor
>
> Looking at the userSimilarity and itemSimilarity methods in AbstractSimilarity, both compute metrics over each User's/Tool's PreferenceArrays, metrics like 'sumX' and 'sumY'. The algorithms go through each PreferenceArray in a single loop, comparing indexes to make sure we don't fall off the end. Eventually, we get to the end of an array, which is caught here:
> if (compare <= 0) {
> if (++xPrefIndex >= xLength) {
> break;
> }
> ...
> The problem is, the metrics may not be correct when the break occurs. Specifically, for the other array, the one that we *didn't* fall off the end of, the metrics don't reflect the preferences we have not yet visited. In the example above, if yPrefLength<yLength, then sumY2 is too low. One fix is to do something like this:
> if (compare <= 0) {
> if (++xPrefIndex >= xLength) {
> sumY2 += squareSumRest(yPrefs,yPrefIndex);
> break;
> }
> ...
> private double squareSumRest(Preference[] preferences, int startingFrom) {
> double squareSum = 0;
> for(int i = startingFrom; i < preferences.length; i++){
> double val = preferences[i].getValue();
> squareSum += val*val;
> }
> return squareSum;
> }
> I believe that the problem affects the sumX and sumY variables (and probably sumXYdiff2), but not the sumXY, sumX2, or sumY2 variables.
> A couple of comments about these two methods:
> 1) They're really hard to reason about. Isn't there a simpler implementation?
> 2) The two methods are very similar. Can't they be combined somehow?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.