You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by gabeweb <ga...@htc.com> on 2010/10/18 11:32:37 UTC

0.0 as null versus number in recommender

I wanted to ask about the problem of deciding to treat 0.0 as a null value or
a valid number in the Mahout recommenders.  Basically, Mahout treats 0.0 as
the null value for ratings, probably because this is compatible with the
sparse vector representations (most obviously with
RandomAccessSparseVector).  But in principle, there is no reason why 0.0
should not be a valid rating, and the null value should be represented by
"null".  Have other folks come up against this problem, and if so, how have
they solved it?  Modifying RandomAccessSparseVector looks like a lot of
work.  I could just add n to all of my ratings so that min(rating+n) > 0 and
then subtract n from the predicted ratings, but that's really, really ugly. 
Does anyone have a better idea?  Thanks.

-- 
View this message in context: http://lucene.472066.n3.nabble.com/0-0-as-null-versus-number-in-recommender-tp1723927p1723927.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: 0.0 as null versus number in recommender

Posted by gabeweb <ga...@htc.com>.
Thanks, I think the small value idea is probably on the right track.  I
should clarify that I've written a Recommender that uses the
clustering-on-Hadoop bits of Mahout, and in doing so I explicitly map the
DataModel preference vectors, which distinguish 0 from null as you say, to
RandomAccessSparseVectors, which don't.  So at that point I just extend the
mapping to go from 0.0 to a small number (MIN_VALUE is actually too small
for me, since my distance metric computes ends up calculating MIN_VALUE *
MIN_VALUE, which of course zeroes out, which I don't want).
-- 
View this message in context: http://lucene.472066.n3.nabble.com/0-0-as-null-versus-number-in-recommender-tp1723927p1729935.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: 0.0 as null versus number in recommender

Posted by Sean Owen <sr...@gmail.com>.
The simplest and best answer is that 0 is not the same as null. The
framework does not treat them as the same, as a rule. A preference of 0 has
some effect on computations; a preference that does not exist has none.

The twist here is that there is no such thing as "null" for the mathematical
entity that is a vector. A vector's value is implicitly 0.0 in any dimension
that has not otherwise been set. This is its "null". In the context of
recommenders it should still be true that these "null" dimensions have no
effect on the computation, which means this principle is preserved.

I can't say I'm 100% sure that 0.0 preferences have no effect on
computations involving RandomAccessSparseVector in the way that non-existent
preferences have no effect on the non-distributed computations. It's a good
principle, but not guaranteed, I would say. It's an artifact of the fact
that things aren't 100% the same when projected entirely into the world of
vector math, useful as that projection is.

However it does mean that there's no way to actually express a 0.0
preference in the context of Hadoop-based computation that involves the
likes of RandomAccessSparseVector. This is a non-trivial difference and
problem indeed.

Don't add an epsilon to all ratings, no. That makes your vectors complete
un-sparse, which is a killer. Instead if you really want to express 0.0, I
might set the value to some very small value instead, perhaps
Double.MIN_VALUE. This slight distortion ought make no practical difference.


On Mon, Oct 18, 2010 at 10:32 AM, gabeweb <ga...@htc.com> wrote:

>
> I wanted to ask about the problem of deciding to treat 0.0 as a null value
> or
> a valid number in the Mahout recommenders.  Basically, Mahout treats 0.0 as
> the null value for ratings, probably because this is compatible with the
> sparse vector representations (most obviously with
> RandomAccessSparseVector).  But in principle, there is no reason why 0.0
> should not be a valid rating, and the null value should be represented by
> "null".  Have other folks come up against this problem, and if so, how have
> they solved it?  Modifying RandomAccessSparseVector looks like a lot of
> work.  I could just add n to all of my ratings so that min(rating+n) > 0
> and
> then subtract n from the predicted ratings, but that's really, really ugly.
> Does anyone have a better idea?  Thanks.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/0-0-as-null-versus-number-in-recommender-tp1723927p1723927.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>