You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Nicolas Maillot <nm...@gmail.com> on 2010/07/17 23:07:01 UTC

Mahalanobis Distance Implementation

Hi all,

I would like to contribute to Mahout by starting with a simple  task.

I had the idea of Implementing the Mahalanobis distance :
http://en.wikipedia.org/wiki/Mahalanobis_distance
This should be quite easy and could be a useful feature.

After looking  at the Mahout code for a couple of hours, I have three
questions:

1) Generally speaking, what is the level of reliability of algorithms
implemented in matrix.linalg ?
2) As inverting a covariance matrix is required, is using the method
inverse(DoubleMatrix2D A) from class Algebra a good way to achieve this ?
3) Does is look reasonable if the computation of the distance is based on
dense matrices ( inside the method distance(Vector v1,Vector v2) method part
of the future MahalanobisDistanceMeasure class) ?
4) What is the best way to create a DoubleMatrix1D from a Vector ?

Thanks a lot for your help.

Cheers,

Nicolas

Re: Mahalanobis Distance Implementation

Posted by Robin Anil <ro...@gmail.com>.
FYI
public MahalanobisDistanceMeasure(Vector meanVector,Matrix inputMatrix,
boolean inversionNeeded)
{
this.meanVector=meanVector;
 if(inversionNeeded)
setCovarianceMatrix(inputMatrix);
else
 setInverseCovarianceMatrix(inputMatrix);
}
 @Override
public void configure(Configuration job) {
// nothing to do
 }

@Override
public Collection<Parameter<?>> getParameters() {
 return Collections.emptyList();
}

@Override
 public void createParameters(String prefix, Configuration jobConf) {
// nothing to do
 }



The wont work in a clustering environment where these class names are
instantiated via reflection. Modify the configure function to read the
deserialized vectors and matrices. See one of the weighted distance measure
classes and see how weight vector is used

Robin

Re: Mahalanobis Distance Implementation

Posted by Ted Dunning <te...@gmail.com>.
a patch or git URL is massively easier for me.

On Thu, Jul 22, 2010 at 4:11 AM, Nicolas Maillot <nm...@gmail.com> wrote:

> Can you do the integration of the code from the web source (there are
> only new files and nothing to  merge) or do you prefer an "official"
> patch ?
>

Re: Mahalanobis Distance Implementation

Posted by Nicolas Maillot <nm...@gmail.com>.
Hi Ted,

Thanks a lot for the trick.

Just implemented it and the SingularValueDecomposition class now
accepts matrices for which numRows()<numCols().
I have also implemented the svd covariance.

As a consequence, all tests are now ok.

There is no patch on JIRA yet.
Can you do the integration of the code from the web source (there are
only new files and nothing to  merge) or do you prefer an "official"
patch ?

Cheers,

Nicolas



On Wed, Jul 21, 2010 at 7:19 PM, Ted Dunning <te...@gmail.com> wrote:
> Is this available as a patch on a JIRA in addition to your web hosted source
> code?  I think that this is very close to ready to commit.  The missing test
> can be handled pretty easily because svd(A') = V D U' so all that is needed
> is to detect the problematic shape, decompose the transposed matrix and
> reverse the roles of U and V.
>
> On Wed, Jul 21, 2010 at 1:09 AM, Nicolas Maillot <nm...@gmail.com> wrote:
>
>> Ted, Robin,
>>
>> Thanks for your remarks. I have taken them into account.
>>
>> I have added a set of unit tests for the SVD (based on
>> org.apache.commons.math.linear.SingularValueDecompositionImplTest):
>>
>> http://www.nicolas-maillot.net/MahoutContribs/TestSingularValueDecomposition.java
>> Note that I had to deactivate some tests due to limitations of the
>> CERN implementation:
>> -Tests where the input matrix has strictly fewer rows than columns
>> -Tests on svd covariance which is not (yet) implemented
>>
>> Otherwise all the "valid"  tests are running fine.
>>
>> I have also fixed the parameter management.
>>
>> Nicolas
>>
>>
>>
>>
>>
>> On Mon, Jul 19, 2010 at 11:08 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Nicolas,
>> >
>> > This was quick work.  Very nice to see.
>> >
>> > To finish, the migration, we need to add test cases as well.
>> >
>> > A good source of test cases for the SingularValueDecomposition would
>> > be package
>> org.apache.commons.math.linear.SingularValueDecompositionImplTest
>> >
>> > On Mon, Jul 19, 2010 at 1:08 PM, Nicolas Maillot <nm...@gmail.com>
>> wrote:
>> >
>> >> As you can see, I have migrated SingularValueDecomposition (useful to
>> >> tackle the inverse covariance problem)  to use the Matrix Class.
>> >>
>> >
>>
>

Re: Mahalanobis Distance Implementation

Posted by Ted Dunning <te...@gmail.com>.
Is this available as a patch on a JIRA in addition to your web hosted source
code?  I think that this is very close to ready to commit.  The missing test
can be handled pretty easily because svd(A') = V D U' so all that is needed
is to detect the problematic shape, decompose the transposed matrix and
reverse the roles of U and V.

On Wed, Jul 21, 2010 at 1:09 AM, Nicolas Maillot <nm...@gmail.com> wrote:

> Ted, Robin,
>
> Thanks for your remarks. I have taken them into account.
>
> I have added a set of unit tests for the SVD (based on
> org.apache.commons.math.linear.SingularValueDecompositionImplTest):
>
> http://www.nicolas-maillot.net/MahoutContribs/TestSingularValueDecomposition.java
> Note that I had to deactivate some tests due to limitations of the
> CERN implementation:
> -Tests where the input matrix has strictly fewer rows than columns
> -Tests on svd covariance which is not (yet) implemented
>
> Otherwise all the "valid"  tests are running fine.
>
> I have also fixed the parameter management.
>
> Nicolas
>
>
>
>
>
> On Mon, Jul 19, 2010 at 11:08 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Nicolas,
> >
> > This was quick work.  Very nice to see.
> >
> > To finish, the migration, we need to add test cases as well.
> >
> > A good source of test cases for the SingularValueDecomposition would
> > be package
> org.apache.commons.math.linear.SingularValueDecompositionImplTest
> >
> > On Mon, Jul 19, 2010 at 1:08 PM, Nicolas Maillot <nm...@gmail.com>
> wrote:
> >
> >> As you can see, I have migrated SingularValueDecomposition (useful to
> >> tackle the inverse covariance problem)  to use the Matrix Class.
> >>
> >
>

Re: Mahalanobis Distance Implementation

Posted by Ted Dunning <te...@gmail.com>.
Nice work Nicolas.  It is comforting to know that our implementation words
reasonably well.  Jake was wishing for exactly what you have just done the
last time he and I saw each other.

On Wed, Jul 21, 2010 at 1:09 AM, Nicolas Maillot <nm...@gmail.com> wrote:

> I have added a set of unit tests for the SVD (based on
> org.apache.commons.math.linear.SingularValueDecompositionImplTest):
>
> http://www.nicolas-maillot.net/MahoutContribs/TestSingularValueDecomposition.java
> Note that I had to deactivate some tests due to limitations of the
> CERN implementation:
> -Tests where the input matrix has strictly fewer rows than columns
> -Tests on svd covariance which is not (yet) implemented
>

Re: Mahalanobis Distance Implementation

Posted by Nicolas Maillot <nm...@gmail.com>.
Ted, Robin,

Thanks for your remarks. I have taken them into account.

I have added a set of unit tests for the SVD (based on
org.apache.commons.math.linear.SingularValueDecompositionImplTest):
http://www.nicolas-maillot.net/MahoutContribs/TestSingularValueDecomposition.java
Note that I had to deactivate some tests due to limitations of the
CERN implementation:
-Tests where the input matrix has strictly fewer rows than columns
-Tests on svd covariance which is not (yet) implemented

Otherwise all the "valid"  tests are running fine.

I have also fixed the parameter management.

Nicolas





On Mon, Jul 19, 2010 at 11:08 PM, Ted Dunning <te...@gmail.com> wrote:
> Nicolas,
>
> This was quick work.  Very nice to see.
>
> To finish, the migration, we need to add test cases as well.
>
> A good source of test cases for the SingularValueDecomposition would
> be package org.apache.commons.math.linear.SingularValueDecompositionImplTest
>
> On Mon, Jul 19, 2010 at 1:08 PM, Nicolas Maillot <nm...@gmail.com> wrote:
>
>> As you can see, I have migrated SingularValueDecomposition (useful to
>> tackle the inverse covariance problem)  to use the Matrix Class.
>>
>

Re: Mahalanobis Distance Implementation

Posted by Ted Dunning <te...@gmail.com>.
Nicolas,

This was quick work.  Very nice to see.

To finish, the migration, we need to add test cases as well.

A good source of test cases for the SingularValueDecomposition would
be package org.apache.commons.math.linear.SingularValueDecompositionImplTest

On Mon, Jul 19, 2010 at 1:08 PM, Nicolas Maillot <nm...@gmail.com> wrote:

> As you can see, I have migrated SingularValueDecomposition (useful to
> tackle the inverse covariance problem)  to use the Matrix Class.
>

Re: Mahalanobis Distance Implementation

Posted by Nicolas Maillot <nm...@gmail.com>.
Hi Ted,

Thanks for the piece of advice.
I have put a first implementation of the Mahalanobis distance in
http://www.nicolas-maillot.net/MahoutContribs/
Those files should be put in three different places to be tested:

-in core/src/test/java/org/apache/mahout/common/distance :
http://www.nicolas-maillot.net/MahoutContribs/TestMahalanobisDistanceMeasure.java

- in  core/src/main/java/org/apache/mahout/common/distance :
http://www.nicolas-maillot.net/MahoutContribs/MahalanobisDistanceMeasure.java

-in math/src/main/java/org/apache/mahout/math :
http://www.nicolas-maillot.net/MahoutContribs/SingularValueDecomposition.java
http://www.nicolas-maillot.net/MahoutContribs/Algebra.java

All compiles and the unit tests are locally fine with the latest
version of Mahout.
I only added new files, so it limit risks.

As you can see, I have migrated SingularValueDecomposition (useful to
tackle the inverse covariance problem)  to use the Matrix Class.
I just Started with Algebra too (just what I needed for the
Mahalanobis stuff), I will continue asap.

One  question, is putting those new implementations in
src/main/java/org/apache/mahout/math a good way to go ?
Migrating the stuff in new files is easier, once we are done, we will
be able to remove the old implementation.
We just need to find the best place for that.

Any comment or feedback is welcome. I will then submit an official
path with a few more comments.

Many thanks,

Cheers,

Nicolas






On Sat, Jul 17, 2010 at 11:40 PM, Ted Dunning <te...@gmail.com> wrote:
> On Sat, Jul 17, 2010 at 2:07 PM, Nicolas Maillot <nm...@gmail.com> wrote:
>
>> Hi all,
>>
>> I would like to contribute to Mahout by starting with a simple  task.
>>
>> I had the idea of Implementing the Mahalanobis distance :
>> http://en.wikipedia.org/wiki/Mahalanobis_distance
>> This should be quite easy and could be a useful feature.
>>
>> After looking  at the Mahout code for a couple of hours, I have three
>> questions:
>>
>> 1) Generally speaking, what is the level of reliability of algorithms
>> implemented in matrix.linalg ?
>>
>
> They should be reasonably good, but they mostly lack unit tests.  Further,
> they have generally not been converted to use Vector and Matrix which is
> something that is probably necessary over time.  I have done a little bit in
> this line, but we need quite a bit more.
>
> As you find bits that you want to use, you should convert them over.  THis
> will be a bit like pulling a thread on a sweater in that it will cause lots
> of additional bits to need conversion.  I will help with this.  As part of
> the conversion, we should add unit tests as well.
>
>
>> 2) As inverting a covariance matrix is required, is using the method
>> inverse(DoubleMatrix2D A) from class Algebra a good way to achieve this ?
>>
>
> Generally it is considered very bad practice to invert a matrix.  Instead,
> you should use some sort of easily solved matrix decomposition.  Commonly,
> QR decomposition is used, but SVD might be useful in certain situations.
>
> As I remember Mahalonobis, it uses moments to estimate principal components
> in the form of the inverse covariance matrix.  Mahalonobis distances are
> then computed in this reference frame.  This should be amenable to QR
> decomposition.
>
>
>> 3) Does is look reasonable if the computation of the distance is based on
>> dense matrices ( inside the method distance(Vector v1,Vector v2) method
>> part
>> of the future MahalanobisDistanceMeasure class) ?
>>
>
> For use in recommendations, it should be assumed that the inputs are sparse,
> but internal products might well be dense if they are fixed in size and
> relatively small.  The computation of distance should accept sparse vectors,
> but projecting them down to small dense vectors to do the distance
> computation is just fine.
>
>
>> 4) What is the best way to create a DoubleMatrix1D from a Vector ?
>>
>
> Don't.
>
> Just convert the code using the DoubleMatrix1D as mentioned above.
>

Re: Mahalanobis Distance Implementation

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Jul 17, 2010 at 2:07 PM, Nicolas Maillot <nm...@gmail.com> wrote:

> Hi all,
>
> I would like to contribute to Mahout by starting with a simple  task.
>
> I had the idea of Implementing the Mahalanobis distance :
> http://en.wikipedia.org/wiki/Mahalanobis_distance
> This should be quite easy and could be a useful feature.
>
> After looking  at the Mahout code for a couple of hours, I have three
> questions:
>
> 1) Generally speaking, what is the level of reliability of algorithms
> implemented in matrix.linalg ?
>

They should be reasonably good, but they mostly lack unit tests.  Further,
they have generally not been converted to use Vector and Matrix which is
something that is probably necessary over time.  I have done a little bit in
this line, but we need quite a bit more.

As you find bits that you want to use, you should convert them over.  THis
will be a bit like pulling a thread on a sweater in that it will cause lots
of additional bits to need conversion.  I will help with this.  As part of
the conversion, we should add unit tests as well.


> 2) As inverting a covariance matrix is required, is using the method
> inverse(DoubleMatrix2D A) from class Algebra a good way to achieve this ?
>

Generally it is considered very bad practice to invert a matrix.  Instead,
you should use some sort of easily solved matrix decomposition.  Commonly,
QR decomposition is used, but SVD might be useful in certain situations.

As I remember Mahalonobis, it uses moments to estimate principal components
in the form of the inverse covariance matrix.  Mahalonobis distances are
then computed in this reference frame.  This should be amenable to QR
decomposition.


> 3) Does is look reasonable if the computation of the distance is based on
> dense matrices ( inside the method distance(Vector v1,Vector v2) method
> part
> of the future MahalanobisDistanceMeasure class) ?
>

For use in recommendations, it should be assumed that the inputs are sparse,
but internal products might well be dense if they are fixed in size and
relatively small.  The computation of distance should accept sparse vectors,
but projecting them down to small dense vectors to do the distance
computation is just fine.


> 4) What is the best way to create a DoubleMatrix1D from a Vector ?
>

Don't.

Just convert the code using the DoubleMatrix1D as mentioned above.