You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chloe Guszo <ch...@gmail.com> on 2013/08/01 04:15:35 UTC

Data distribution guidance for recommendation engines

Hi all,

This questions stems from my use of the alternating least squares method in
mahout, but errs on the theoretical side. If this is the wrong place for
such a question, I apologize up front and would gladly direct my question
to a more appropriate forum, as per your suggestions.

I have been thinking about how the distribution of rating data can
influence a model built using ALS or any matrix factorization method for
that matter.

If I split my data into train and test sets, I can show good performance of
the model on the train set. What might I expect given an uneven
distribution of ratings? Imagine a situation where 50% of the ratings are
1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do
people pre-process their data to avoid skewed ratings distributions? How
about the rating scale itself. For example, given [1:3] vs [1:10] ranges,
in with the former, you've got a 1/3 chance of predicting the correct
rating, say, while in the latter case it is a 1/10.  Or, when is sparse too
sparse, or can these questions even be answered because they are too
system/context specific?

Ultimately, I'm trying to figure out under what conditions one would look
at a model and say "that is crap", pardon my language. Do any more
experienced users have any advice to offer on when a factor model would
break down or any of my points above?

Thanks in advance,
-Chloe

Re: Data distribution guidance for recommendation engines

Posted by Sean Owen <sr...@gmail.com>.
On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo <ch...@gmail.com> wrote:
> If I split my data into train and test sets, I can show good performance of

Good performance according to what metric? it makes a lot of
difference whether you are talking about precision/recall or RMSE.

> the model on the train set. What might I expect given an uneven
> distribution of ratings? Imagine a situation where 50% of the ratings are
> 1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do

In the general case, recommenders don't rate items at all, they rank
items. So this may not be a question that matters.

> about the rating scale itself. For example, given [1:3] vs [1:10] ranges,
> in with the former, you've got a 1/3 chance of predicting the correct
> rating, say, while in the latter case it is a 1/10.  Or, when is sparse too

Why do you say that... the recommender is not choosing ratings randomly.


> Ultimately, I'm trying to figure out under what conditions one would look
> at a model and say "that is crap", pardon my language. Do any more

You use evaluation metrics, which are imperfect, but about the best
you can do in the lab. If you're already doing that, you're doing
fine. This is true no matter what your input is like.

If your input is things like click count, then they will certainly be
mostly 1 and follow a power-law distribution. This is no problem but
you want to follow the 'implicit feedback' version of ALS, where you
are not trying to reconstruct the input but use the input as weights.