You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/02/25 18:19:42 UTC

Asymmetric training and query in recommenders

Recommenders need user preference data. The more the better, right? Well, yes and no…

Assuming you have a catalog that may have things added often but older items also remain in stock for some time. Training of user preference data over a fairly long time period will likely be a good thing. But this user history of everything, may not be the best query to use for returning recs.

Using an offline precision metric (MAP@n) and real ecommerce data we build Mahout recommender models on 3, 6, 9, and 12 months of data. We held out the most recent 10% for testing the recommender’s predictions. As one would expect the more data the better. But I think there is a hidden problem in this.

Using a user’s entire history may not lead to the best recs for today. The intuition is that the most recent n actions should be used for making recs, thereby capturing the user’s current intent.

Unfortunately Mahout’s recommenders use the same data to build the “indicator matrix” as they do to make the query for returning recs.

Current Mahout:
B = history of all preferences by all users
Mahout calculates recs by doing 
[B’B]B' = R, where [B’B] is actually the product of the RowSimilarityJob and so is an “indicator matrix” not just a cooccurrence matrix. I always use Log likelihood or LLR in the RSJ so [B’B] is to be seen as shorthand for this.

The problem with this approach is that B is the only input and therefore used for the query as well as the training.

Using the Solr+Mahout recommender--where the query is in realtime and the training occurs periodically in the background--solves this problem nicely. The indicator matrix is produced on as much data as possible but there is no requirement that all of that be used in the query. For the Solr+Mahout recommender I’d rather say:
[B’B]h = R, where h is a user's history going back as far as you think good and B is as much data as makes sense for your catalog. Picking h is probably done by taking the most recent n actions/prefs rather than a point in time cutoff because different people are more active than others.

I think this indicates an improvement that could be made to Mahout’s recommender. Either B and H can be supplied separately or we can leave the query to Solr.

Anyone have an opinion?