You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Mattias Hilliges <hi...@neofonie.de> on 2010/04/26 15:02:03 UTC

Problems with AbstractSimilarity

Hi,
i detected the following behaviour, that seems a bit strange to me:
Be v=(v1, v2,...,vn) and w=(w1, w2, ...,wm) vectors, that are used to
compute the similarity between two items/users. If all vi, that overlap
with w (this means vi!=0 and wi!=0), are equal, and if all wj, that
overlap with v, are equal, no euclidean or pearson similarity can be
computed.

The attached test considers the following vectors: v=(0,2; 0,2; 0,4) and
w=(0,7; 0,7; 0). The overlapping vector components of v are all 0,2. The
overlapping components of w are all 0,7.

The problem is, that "double computeResult(int n, double sumXY, double
sumX2, double sumY2, double sumXYdiff2)" in the corresponding subclass
of AbstractSimilarity is called with parameters sumXY=sumX2=sumY2=0 and
therefore returns Double.NaN. This behaviour contradicts the behaviour
described in the book "Mahout in Action", p.49. The last complete
sentence here is: "Note that we were able compute some notion of
similarity for all pairs of users here, whereas the Pearson correlation
couldn't produce an answer for users 1 and 3." Because of the described
problem, the euclidean algorithm can't produce an answer either. This is
a special case of the described problem, where there is only one overlap.

Regards,
Mattias

-- 
--------------------------------
Mattias Hilliges
Softwareentwicklung
Forschung und Entwicklung

neofonie
Technologieentwicklung und
Informationsmanagement GmbH
Robert-Koch-Platz 4
10115 Berlin
fon: +49.30 24627 100
fax: +49.30 24627 120
mattias.hilliges@neofonie.de
http://www.neofonie.de 

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschaeftsfuehrung
Helmut Hoffer von Ankershoffen
(Sprecher der Geschaeftsfuehrung)
Nurhan Yildirim
--------------------------------


Re: Problems with AbstractSimilarity

Posted by Mattias Hilliges <hi...@neofonie.de>.
Many thanks for that detailed answer. Probably we won't use the
euclidean distance anyway in our project. I just wanted to document that
behaviour, because I thought, it could be a bug.

Mattias

-- 
--------------------------------
Mattias Hilliges
Softwareentwicklung
Forschung und Entwicklung

neofonie
Technologieentwicklung und
Informationsmanagement GmbH
Robert-Koch-Platz 4
10115 Berlin
fon: +49.30 24627 100
fax: +49.30 24627 120
mattias.hilliges@neofonie.de
http://www.neofonie.de 

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschaeftsfuehrung
Helmut Hoffer von Ankershoffen
(Sprecher der Geschaeftsfuehrung)
Nurhan Yildirim
--------------------------------


Re: Problems with AbstractSimilarity

Posted by Sean Owen <sr...@gmail.com>.
On Mon, Apr 26, 2010 at 2:02 PM, Mattias Hilliges <hi...@neofonie.de> wrote:
> Hi,
> i detected the following behaviour, that seems a bit strange to me:
> Be v=(v1, v2,...,vn) and w=(w1, w2, ...,wm) vectors, that are used to
> compute the similarity between two items/users. If all vi, that overlap
> with w (this means vi!=0 and wi!=0), are equal, and if all wj, that
> overlap with v, are equal, no euclidean or pearson similarity can be
> computed.

The Pearson correlation is undefined on two series if either one has
all the same values. This is because the standard deviation of the
series is 0, and the correlation computation involves scaling by
(dividing by) the standard deviations.

For Euclidean, the distance is normalized by the sum of the sizes of
the preference vectors. In this case, both those sizes are 0, since
the data is centered (mean 0) and these equal values both map to
(0,0). It's quite a corner case.

This step is a bit questionable, and old, and could be removed. The
idea is to not let one user's scale of preference values affect the
result -- whether I rate on a 1 to 5 or 10 to 50 scale. This is for
consistency with Pearson's behavior. But I think you could easily
argue it's not necessary or even desirable to emulate this property.

>
> The problem is, that "double computeResult(int n, double sumXY, double
> sumX2, double sumY2, double sumXYdiff2)" in the corresponding subclass
> of AbstractSimilarity is called with parameters sumXY=sumX2=sumY2=0 and
> therefore returns Double.NaN. This behaviour contradicts the behaviour
> described in the book "Mahout in Action", p.49. The last complete
> sentence here is: "Note that we were able compute some notion of
> similarity for all pairs of users here, whereas the Pearson correlation
> couldn't produce an answer for users 1 and 3." Because of the described
> problem, the euclidean algorithm can't produce an answer either. This is
> a special case of the described problem, where there is only one overlap.

True, well, the book is presenting a simplified version of the
Euclidean similarity, without anything else that happens in the real
code like centering or normalizing for dimension. The book is correct
about the simplified version, but its point would not be correct of
the actual implementation as it stands now.

I don't think that really harms the point, that funny things happen
with sparse data, but it's not ideal. And, given that the cause is a
normalization which can arguably be removed, I'd be fine removing that
normalization (unless someone stops me for a good reason). Then it
would all be consistent.

Sean