You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Paulo Villegas (Created) (JIRA)" <ji...@apache.org> on 2011/11/27 22:13:39 UTC

[jira] [Created] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Error in formula for preference estimation in GenericItemBasedRecommender
-------------------------------------------------------------------------

                 Key: MAHOUT-898
                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
         Environment: mahout-core
            Reporter: Paulo Villegas
            Assignee: Sean Owen
            Priority: Minor
             Fix For: 0.6


The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)

The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158654#comment-13158654 ] 

Ted Dunning commented on MAHOUT-898:
------------------------------------

{quote}
Incidentally, we also tried loglikelihood as similarity metric (and a few other ones); we set on Pearson because it worked best. This was before I measured precision/recall, I'll probably now repeat the experiments to get those metrics with log-likelihood to see what comes up.
{quote}

As you correctly noted earlier, precision@10 or precision@20 is a much better measure of quality.  It will be good to hear your results.

When you do test with log-likelihood, make sure you try with two strategies.  First with only positive votes as interactions and secondly with any vote as an interaction.

                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158101#comment-13158101 ] 

Sean Owen commented on MAHOUT-898:
----------------------------------

(Pearson is often mentioned in early literature but it's hardly the fastest or generally best metric -- log-likelihood is a better default. But that's a separate question.)

Now of course if negative similarity rarely comes up, it doesn't matter as much what we do with it. It would not be a big deal to change it. I think the key questions are both what is least surprising, and what is most effective? I tend to want to implement the simple logical "base case" thing and leave hooks to modify. So what's the simple, logical thing here?

Stick to a 1-5 rating range for simplicity. Say you've rated item A a 4, and it has similarity s to item B. (Ignore any other items.) I think it would be surprising if the weighted average here were not 4, regardless of s -- because it is for all positive s. But what if s is negative? Your change would mean any such item has an estimated rating of -4 -- or, capped to 1. But this is a very dissimilar item. Maybe 1 makes more sense as an estimate than 4. But it's going to be 1 for any item regardless of the rating too, not just similarity. That somehow feels funny.

Really, a weight represents a strength of vote for a certain answer in the weighted average. Higher weights push the answers towards the value it weights. A 0 weight does nothing. A negative weight therefore is a vote for the answer to be far from the value it weights. 1 is a vote to make the answer exactly X; -1 is a vote to make it infinitely far from X.

So say you have a rating of 3 with similarity -1, and a 4 with similarity 1. Right now the implementation would estimate "infinity" and cap it to 5. This change would cause it to estimate 0.5 and cap to 1. 5 seems right-er than 1, in balancing being "exactly 4" and "extremely far from 3". I *think* you'd find other scenarios work out this way. (It assumes you accept the premise of what a negative weight should mean.)


Of course your point stands that this leads often to behavior that's intuitively undesirable. While at the moment I still feel like the handling of negative weights is as logical as it can be, I think negative weights are problematic. I suppose I'd say "don't use them" is the real solution. You could modify / wrap Pearson to add 1 to the similarity value (accepting that this has its own logic issues, but, maybe better in practice.) Or better probably use another metric.


Did you happen to test both ways to see if it consistently makes better recommendations on a data set? Not suggesting you need to just curious. That would be an interesting empirical test.


I think it's an interesting question and good to open it up again. I liked Tom/Tamas's logic, which I've remembered and cribbed above, from last time. What do you think?
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Paulo Villegas (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paulo Villegas updated MAHOUT-898:
----------------------------------

    Attachment: GenericItemBasedRecommender.diff

The patch for GenericItemBasedRecommender.java
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Paulo Villegas (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158645#comment-13158645 ] 

Paulo Villegas commented on MAHOUT-898:
---------------------------------------

Hi, thanks for the long reply!

I did test both approaches with a dataset, namely I took the Movielens 1M dataset, did the usual training/test separation and measured output. Interestingly, my proposed modification actually increased the prediction error (MAE from 0.78 to 1.01, for one of the runs). But the precision/recall measures that I also took give a totally different picture. For instance, Precision@10 is 0.05% (i.e. negligible) for the original version, while the modified version gets a value of 5%. That is two orders of magnitude greater. Recall values are similar (0.04% against 1.6% for recall@10). You really can see that in the recommendation lists: the modified version usually produces recommendations that are much more recognizable and "semantically" related to the training set (though as I said, a degree of surprise is also good).

What i believe is happening is that the original version is pushing items to the top (capped to maximum) based on negative correlations, and they totally block the opportunity for the items in the testset to get recommended, hence the low precision/recall. But the modified version, that works much better in this context, increases prediction error because it tends to produce lower prediction values; given that the rating statistics are biased towards higher values (there are fewer low ratings), this increases the overall error. But this is just an unconfirmed guess.

Additionally to the cases you mention, one potential caveat in my proposed modification is the asymmetry in how the negative correlations affect positive and negative ratings. Negative correlations imply the behaviour should be reversed, and this works to convert high ratings into low ratings, but not viceversa (low ratings get converted into still lower ones). I believe this is something that mean-centering or the like could solve (in practical terms, if ratings were from -2 to 2, it would work more naturally). This is something I intend to do, but the modification to the software is not as straightforward as the abs.

Incidentally, we also tried loglikelihood as similarity metric (and a few other ones); we set on Pearson because it worked best. This was before I measured precision/recall, I'll probably now repeat the experiments to get those metrics with log-likelihood to see what comes up.

                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Paulo Villegas (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158668#comment-13158668 ] 

Paulo Villegas commented on MAHOUT-898:
---------------------------------------

|When you do test with log-likelihood, make sure you try with two strategies. First with only positive |votes as interactions and secondly with any vote as an interaction.

I'll do that, yes. Can't work on it right now, but will try it later this week.

Sean: yes, that would be a good solution. Anyone would then be able to try both approaches for their use case (since I don't believe there is such a thing as a universal solution).
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-898:
-----------------------------

    Fix Version/s:     (was: 0.6)

Happy to move this back into 0.6 when there's a patch
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158028#comment-13158028 ] 

Sean Owen commented on MAHOUT-898:
----------------------------------

I understand the issue, but this doesn't fix it. Say your ratings are between 1 and 5. Say you have similarity -0.5 to an item rated 3 and -0.5 to an item rated 4. Using the absolute value in the denominator only would lead you to estimate a preference of -3.5, which is also not possible. It's not even reasonable to cap it to 1 here.

Really... negative weights are just a problem since they don't make sense. In practice, in the framework, the *only* metric with this problem is Pearson, since it's the only one that actually returns values < 0. In retrospect would have been nicer to define this as returning a value between 0 and 1.

You could use (1+similarity) as a weight, since that's at least nonnegative. I feel like I did it this way in the beginning... and took it out as it caused another problem. I'd have to think about just why that was. We could go back to that; it has non-trivial implications.

I don't want to make this exact change but leave it open for some other ideas.
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Paulo Villegas (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163539#comment-13163539 ] 

Paulo Villegas commented on MAHOUT-898:
---------------------------------------

I sent the 'trivial' patch (taking absolute value) as an attachment above. I could do a similar quick fix for the GenericUserBasedRecommender.

The not-so-trivial patch (mean centering of ratings before applying the formula) will take a little longer, since I'm still coming to grips with the code and how to insert that.

BTW I hope to provide later today the promised prec&recall values for log-likelihood
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Paulo Villegas (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158051#comment-13158051 ] 

Paulo Villegas commented on MAHOUT-898:
---------------------------------------

I think that your example is actually desired behaviour :-) Pearson correlation measures linear dependency between two variables; if it's 0 it means they are independent from each other (at least linearly) so that that item shouldn't influence your preference, and it does work that way. But if it has a negative value, it means that there is a linear dependence with negative slope. That is, my preferences for the item being estimated are negatively correlated with those other items: when they have a high rating, mine for the new item should be low. So, if the items have 3 & 4, giving a 1 (capping to the minimum) is not totally unreasonable, though perhaps a bit extreme (having only items with negative correlations shouldn't happen too often anyway, though I've indeed seen that).

Even though Pearson is the only metric producing negative values, it is not a fringe case, since it is probably the most used metric for neighborhood CF (and for good reason -- it tends to produce the best results and it costs much less than rank-based metrics such as Spearman). Hence ensuring it behaves reasonably is good.

I saw the (1+similarity) variant when looking at previous versions, it comes from issue MAHOUT-321. But the problem, when it comes to Pearson, is that it enables items with correlation of 0 to have influence on the final result (and they shouldn't, since they are uncorrelated with the item being computed).

The issue would probably work better if ratings could be mean-centered (i.e. remove the mean before getting into the preference estimation), which is also a standard practice. I'm trying to do something along this, but in the mean time I proposed the 'abs' solution to at least avoid bizarre outputs (the current behaviour produces 'surprising' recommendations, and while some serendipity is a desired behaviour in a recommender, it would be better to have a way of controlling it).
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158657#comment-13158657 ] 

Sean Owen commented on MAHOUT-898:
----------------------------------

Yes I could imagine this improves metrics in some cases. I ran a little test and actually saw a small RMSE decrease over the existing implementation for example. I truly don't know whether it's overall going to help or hurt things.

I would actually phrase your suggestion differently: instead of construing a negative weight as a vote against a value in the weighted average, it's construing it as a *positive* vote for the *opposite* value. Here opposite means the negative of the rating. And that's the only bit I have a problem with, conceptually. If the opposite of 4 on a scale of 5 were 2, instead of -4, it would seem complete. (Really, should be as far below the user's mean rating as 4 is above it -- and it happens to do that automatically if the mean is already 0, yes. It won't be 0 in general.)

I think that's a perfectly coherent strategy, one I hadn't thought of before. It is different from what's in the literature and what's been in the code. I still hesitate to change the simple weighted average here. At the same time I think it would be fine to incorporate this other strategy.

We could make this pluggable with a default implementation that does what the algorithm today does. It adds yet another hook and pluggable module to worry about, but, I don't think it's so bad.

Am I missing anything easier? Looking for a way to balance the many issues in this thread as best we can.

                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.6
>
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-898) Error in formula for preference estimation in GenericItemBasedRecommender

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-898.
------------------------------

    Resolution: Won't Fix

I think this particular discussion ended up as a WontFix, at least for purposes of this release.
                
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
>                 Key: MAHOUT-898
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-898
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>         Environment: mahout-core
>            Reporter: Paulo Villegas
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: patch
>         Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based recommender normalizes by the sum of similarities for items used in estimation. But the terms in the sum taken to normalize should be in absolute value, since they can be negative (e.g. when using Pearson correlation, similarity is in [-1,1]). Now they are not, and as a result when there are negative and positive values they cancel out, giving a small denominator and incorrectly boosting the preference for the item (symptom: it is easy for a predicted preference to take the maximum value, since the quotient becomes large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira