You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/09/02 22:45:32 UTC

Re: Using SVD with Canopy/KMeans

  (cross-posting to dev)

Hi Jake,

I'm on thin ice here, but just a few more words on the math details here 
would help me sort this out. I've run the DistributedLanczosSolver on 
the small testdata set in TestClusterDumper:

     Path output = getTestTempDirPath("output");
     Path tmp = getTestTempDirPath("tmp");
     Configuration config = new Configuration();
     Path eigenvectors = new Path(output, "eigenvectors");
     config.set("mapred.output.dir", eigenvectors.toString());
     DistributedLanczosSolver solver = new DistributedLanczosSolver();
     solver.setConf(config);
     Path testData = getTestTempDirPath("testdata");
     solver.run(testData, tmp, sampleData.size(), 
sampleData.get(0).get().size(), false, 8);

This produces 7 (not 8?) vectors in the eigenvectors file. If I then 
build DistributedRowMatrices out of these I get matrices that are 
ill-shaped to multiply directly. Clearly a literal translation of your 
text is incorrect:

     // now multiply the testdata matrix and the eigenvector matrix
     DistributedRowMatrix svd = new DistributedRowMatrix(eigenvectors, 
tmp, 8, 38);
     DistributedRowMatrix data= new DistributedRowMatrix(testData, tmp, 
15, 38);
     DistributedRowMatrix sData = data.times(svd);

     // now run the Canopy job to prime kMeans canopies
     CanopyDriver.runJob(svd.getRowPath(), output, measure, 8, 4, false, 
false);

Reading up on eigendecomposition, it looks like (DATA ~= SVD D SVD') 
would be more like it. But the solver only outputs the eigenvectors and 
it ignores the eigenvalues. So, I cannot construct D. Can you point me 
back towards the right path? It has been soo long since my grad school 
advanced matrices course.

Isn't this related to spectral clustering?

On 9/2/10 10:50 AM, Jake Mannix wrote:
> Derek,
>
>    The step Jeff's referring to is that the SVD output is a set of vectors in
> the "column space" of your original set of rows (your input matrix).  If you
> want to cluster your original data, projected onto this new SVD basis, you
> need to matrix multiply your SVD matrix by your original data.  Depending on
> how big your data is (number of rows and columns and rank of the reduction),
> you can do this in either one or two map-reduce passes.
>
>    If you need more detail, I can spell that out a little more directly.  It
> should actually be not just explained in words, but coded into the examples,
> now that I think of it... need. more. hours. in. day....
>
>    -jake

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

I think you were translating.  But the last multiply is still redundant, I
think.

On Sat, Sep 11, 2010 at 4:55 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote:
>
> > Should be close.  The matrixMult step may be redundant if you want to
> > cluster the same data that you decomposed.  That would make the second
> > transpose unnecessary as well.
>
> Hmm, I thought I was just translating what Jeff had done below,
> specifically:
>
> >>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >>>   sData.configure(conf);
> >>>
> >>>   // now run the Canopy job to prime kMeans canopies
> >>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> >> false);
> >>>   // now run the KMeans job
> >>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>
>
>
> >
> > On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> To put this in bin/mahout speak, this would look like, munging some
> names
> >> and taking liberties with the actual argument to be passed in:
> >>
> >> bin/mahout svd (original -> svdOut)
> >> bin/mahout cleansvd ...
> >> bin/mahout transpose svdOut -> svdT
> >> bin/mahout transpose original -> originalT
> >> bin/mahout matrixmult originalT svdT -> newMatrix
> >> bin/mahout kmeans newMatrix
> >>
> >> Is that about right?
> >>
> >>
> >> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
> >>
> >>> Ok, the transposed computation seems to work and the cast exception was
> >> caused by my unit test writing LongWritable keys to the testdata file.
> The
> >> following test produces a comparable answer to the non-distributed case.
> I
> >> still want to rename the method to transposeTimes for clarity. And
> better,
> >> implement timesTranspose to make this particular computation more
> efficient:
> >>>
> >>> public void testKmeansDSVD() throws Exception {
> >>>   DistanceMeasure measure = new EuclideanDistanceMeasure();
> >>>   Path output = getTestTempDirPath("output");
> >>>   Path tmp = getTestTempDirPath("tmp");
> >>>   Path eigenvectors = new Path(output, "eigenvectors");
> >>>   int desiredRank = 13;
> >>>   DistributedLanczosSolver solver = new DistributedLanczosSolver();
> >>>   Configuration config = new Configuration();
> >>>   solver.setConf(config);
> >>>   Path testData = getTestTempDirPath("testdata");
> >>>   int sampleDimension = sampleData.get(0).get().size();
> >>>   solver.run(testData, tmp, eigenvectors, sampleData.size(),
> >> sampleDimension, false, desiredRank);
> >>>
> >>>   // now multiply the testdata matrix and the eigenvector matrix
> >>>   DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
> >> tmp, desiredRank - 1, sampleDimension);
> >>>   JobConf conf = new JobConf(config);
> >>>   svdT.configure(conf);
> >>>   DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
> >> sampleData.size(), sampleDimension);
> >>>   a.configure(conf);
> >>>   DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >>>   sData.configure(conf);
> >>>
> >>>   // now run the Canopy job to prime kMeans canopies
> >>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> >> false);
> >>>   // now run the KMeans job
> >>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
> >> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
> >>>   // run ClusterDumper
> >>>   ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> >> "clusters-2"), new Path(output, "clusteredPoints"));
> >>>   clusterDumper.printClusters(termDictionary);
> >>> }
> >>>
> >>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
> >>>> Looking at the single unit test of DMR.times() it seems to be
> >> implementing Matrix expected = inputA.transpose().times(inputB), and not
> >> inputA.times(inputB.transpose()), so the bounds checking is correct as
> >> implemented. But the method still has the wrong name and AFAICT is not
> >> useful for performing this particular computation. Should I use this
> >> instead?
> >>>>
> >>>> DistributedRowMatrix sData =
> >> a.transpose().t[ransposeT]imes(svdT.transpose())
> >>>>
> >>>> ugh! And it still fails with:
> >>>>
> >>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> >> be cast to org.apache.hadoop.io.IntWritable
> >>>>   at
> >>
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> >>>>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>>>   at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

I think you were translating.  But the last multiply is still redundant, I
think.

On Sat, Sep 11, 2010 at 4:55 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote:
>
> > Should be close.  The matrixMult step may be redundant if you want to
> > cluster the same data that you decomposed.  That would make the second
> > transpose unnecessary as well.
>
> Hmm, I thought I was just translating what Jeff had done below,
> specifically:
>
> >>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >>>   sData.configure(conf);
> >>>
> >>>   // now run the Canopy job to prime kMeans canopies
> >>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> >> false);
> >>>   // now run the KMeans job
> >>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>
>
>
> >
> > On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> To put this in bin/mahout speak, this would look like, munging some
> names
> >> and taking liberties with the actual argument to be passed in:
> >>
> >> bin/mahout svd (original -> svdOut)
> >> bin/mahout cleansvd ...
> >> bin/mahout transpose svdOut -> svdT
> >> bin/mahout transpose original -> originalT
> >> bin/mahout matrixmult originalT svdT -> newMatrix
> >> bin/mahout kmeans newMatrix
> >>
> >> Is that about right?
> >>
> >>
> >> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
> >>
> >>> Ok, the transposed computation seems to work and the cast exception was
> >> caused by my unit test writing LongWritable keys to the testdata file.
> The
> >> following test produces a comparable answer to the non-distributed case.
> I
> >> still want to rename the method to transposeTimes for clarity. And
> better,
> >> implement timesTranspose to make this particular computation more
> efficient:
> >>>
> >>> public void testKmeansDSVD() throws Exception {
> >>>   DistanceMeasure measure = new EuclideanDistanceMeasure();
> >>>   Path output = getTestTempDirPath("output");
> >>>   Path tmp = getTestTempDirPath("tmp");
> >>>   Path eigenvectors = new Path(output, "eigenvectors");
> >>>   int desiredRank = 13;
> >>>   DistributedLanczosSolver solver = new DistributedLanczosSolver();
> >>>   Configuration config = new Configuration();
> >>>   solver.setConf(config);
> >>>   Path testData = getTestTempDirPath("testdata");
> >>>   int sampleDimension = sampleData.get(0).get().size();
> >>>   solver.run(testData, tmp, eigenvectors, sampleData.size(),
> >> sampleDimension, false, desiredRank);
> >>>
> >>>   // now multiply the testdata matrix and the eigenvector matrix
> >>>   DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
> >> tmp, desiredRank - 1, sampleDimension);
> >>>   JobConf conf = new JobConf(config);
> >>>   svdT.configure(conf);
> >>>   DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
> >> sampleData.size(), sampleDimension);
> >>>   a.configure(conf);
> >>>   DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >>>   sData.configure(conf);
> >>>
> >>>   // now run the Canopy job to prime kMeans canopies
> >>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> >> false);
> >>>   // now run the KMeans job
> >>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
> >> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
> >>>   // run ClusterDumper
> >>>   ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> >> "clusters-2"), new Path(output, "clusteredPoints"));
> >>>   clusterDumper.printClusters(termDictionary);
> >>> }
> >>>
> >>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
> >>>> Looking at the single unit test of DMR.times() it seems to be
> >> implementing Matrix expected = inputA.transpose().times(inputB), and not
> >> inputA.times(inputB.transpose()), so the bounds checking is correct as
> >> implemented. But the method still has the wrong name and AFAICT is not
> >> useful for performing this particular computation. Should I use this
> >> instead?
> >>>>
> >>>> DistributedRowMatrix sData =
> >> a.transpose().t[ransposeT]imes(svdT.transpose())
> >>>>
> >>>> ugh! And it still fails with:
> >>>>
> >>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> >> be cast to org.apache.hadoop.io.IntWritable
> >>>>   at
> >>
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> >>>>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>>>   at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Using SVD with Canopy/KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote:

> Should be close.  The matrixMult step may be redundant if you want to
> cluster the same data that you decomposed.  That would make the second
> transpose unnecessary as well.

Hmm, I thought I was just translating what Jeff had done below, specifically:

>>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>   sData.configure(conf);
>>> 
>>>   // now run the Canopy job to prime kMeans canopies
>>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>> false);
>>>   // now run the KMeans job
>>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,



> 
> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> To put this in bin/mahout speak, this would look like, munging some names
>> and taking liberties with the actual argument to be passed in:
>> 
>> bin/mahout svd (original -> svdOut)
>> bin/mahout cleansvd ...
>> bin/mahout transpose svdOut -> svdT
>> bin/mahout transpose original -> originalT
>> bin/mahout matrixmult originalT svdT -> newMatrix
>> bin/mahout kmeans newMatrix
>> 
>> Is that about right?
>> 
>> 
>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>> 
>>> Ok, the transposed computation seems to work and the cast exception was
>> caused by my unit test writing LongWritable keys to the testdata file. The
>> following test produces a comparable answer to the non-distributed case. I
>> still want to rename the method to transposeTimes for clarity. And better,
>> implement timesTranspose to make this particular computation more efficient:
>>> 
>>> public void testKmeansDSVD() throws Exception {
>>>   DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>   Path output = getTestTempDirPath("output");
>>>   Path tmp = getTestTempDirPath("tmp");
>>>   Path eigenvectors = new Path(output, "eigenvectors");
>>>   int desiredRank = 13;
>>>   DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>   Configuration config = new Configuration();
>>>   solver.setConf(config);
>>>   Path testData = getTestTempDirPath("testdata");
>>>   int sampleDimension = sampleData.get(0).get().size();
>>>   solver.run(testData, tmp, eigenvectors, sampleData.size(),
>> sampleDimension, false, desiredRank);
>>> 
>>>   // now multiply the testdata matrix and the eigenvector matrix
>>>   DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>> tmp, desiredRank - 1, sampleDimension);
>>>   JobConf conf = new JobConf(config);
>>>   svdT.configure(conf);
>>>   DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>> sampleData.size(), sampleDimension);
>>>   a.configure(conf);
>>>   DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>   sData.configure(conf);
>>> 
>>>   // now run the Canopy job to prime kMeans canopies
>>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>> false);
>>>   // now run the KMeans job
>>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>   // run ClusterDumper
>>>   ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>   clusterDumper.printClusters(termDictionary);
>>> }
>>> 
>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>> Looking at the single unit test of DMR.times() it seems to be
>> implementing Matrix expected = inputA.transpose().times(inputB), and not
>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>> implemented. But the method still has the wrong name and AFAICT is not
>> useful for performing this particular computation. Should I use this
>> instead?
>>>> 
>>>> DistributedRowMatrix sData =
>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>> 
>>>> ugh! And it still fails with:
>>>> 
>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>> be cast to org.apache.hadoop.io.IntWritable
>>>>   at
>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>   at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> 
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote:

> Should be close.  The matrixMult step may be redundant if you want to
> cluster the same data that you decomposed.  That would make the second
> transpose unnecessary as well.

Hmm, I thought I was just translating what Jeff had done below, specifically:

>>> DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>   sData.configure(conf);
>>> 
>>>   // now run the Canopy job to prime kMeans canopies
>>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>> false);
>>>   // now run the KMeans job
>>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,



> 
> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> To put this in bin/mahout speak, this would look like, munging some names
>> and taking liberties with the actual argument to be passed in:
>> 
>> bin/mahout svd (original -> svdOut)
>> bin/mahout cleansvd ...
>> bin/mahout transpose svdOut -> svdT
>> bin/mahout transpose original -> originalT
>> bin/mahout matrixmult originalT svdT -> newMatrix
>> bin/mahout kmeans newMatrix
>> 
>> Is that about right?
>> 
>> 
>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>> 
>>> Ok, the transposed computation seems to work and the cast exception was
>> caused by my unit test writing LongWritable keys to the testdata file. The
>> following test produces a comparable answer to the non-distributed case. I
>> still want to rename the method to transposeTimes for clarity. And better,
>> implement timesTranspose to make this particular computation more efficient:
>>> 
>>> public void testKmeansDSVD() throws Exception {
>>>   DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>   Path output = getTestTempDirPath("output");
>>>   Path tmp = getTestTempDirPath("tmp");
>>>   Path eigenvectors = new Path(output, "eigenvectors");
>>>   int desiredRank = 13;
>>>   DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>   Configuration config = new Configuration();
>>>   solver.setConf(config);
>>>   Path testData = getTestTempDirPath("testdata");
>>>   int sampleDimension = sampleData.get(0).get().size();
>>>   solver.run(testData, tmp, eigenvectors, sampleData.size(),
>> sampleDimension, false, desiredRank);
>>> 
>>>   // now multiply the testdata matrix and the eigenvector matrix
>>>   DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>> tmp, desiredRank - 1, sampleDimension);
>>>   JobConf conf = new JobConf(config);
>>>   svdT.configure(conf);
>>>   DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>> sampleData.size(), sampleDimension);
>>>   a.configure(conf);
>>>   DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>   sData.configure(conf);
>>> 
>>>   // now run the Canopy job to prime kMeans canopies
>>>   CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>> false);
>>>   // now run the KMeans job
>>>   KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>   // run ClusterDumper
>>>   ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>   clusterDumper.printClusters(termDictionary);
>>> }
>>> 
>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>> Looking at the single unit test of DMR.times() it seems to be
>> implementing Matrix expected = inputA.transpose().times(inputB), and not
>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>> implemented. But the method still has the wrong name and AFAICT is not
>> useful for performing this particular computation. Should I use this
>> instead?
>>>> 
>>>> DistributedRowMatrix sData =
>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>> 
>>>> ugh! And it still fails with:
>>>> 
>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>> be cast to org.apache.hadoop.io.IntWritable
>>>>   at
>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>   at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> 
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Posted by Jake Mannix <ja...@gmail.com>.

Well I'm not sure we need a "DiagonalMatrix" just yet - currently
DRM.times() only takes other DRM's as input.  If we extended that to, say,
VectorIterable (which DRM, SparseMatrix, and DenseMatrix all all
implementations of), then we can just switch based on class type,
serializing the non-DRM matrix instances and sending them out into the
DistrubtedCache and using a different Mapper/Reducer set, which would have
an in-memory reconstituted Matrix to work with however was most efficient.

  -jake

On Sun, Sep 12, 2010 at 12:18 PM, Ted Dunning <te...@gmail.com> wrote:

> Just an alert that this is heading us towards having a DiagonalMatrix
> implementation.
>
> It might be jsut as well to store the diagonal matrix as a sparse matrix
> and
> use our existing times method.  As Jake points out, we need some
> specialization towards DRM by in-memory matrix.  That would give more
> generality and just as much speed as a specialized diagonal implementation.
>
> On Sun, Sep 12, 2010 at 12:12 PM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
> > +1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also
> > timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?
> >
>

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

Just an alert that this is heading us towards having a DiagonalMatrix
implementation.

It might be jsut as well to store the diagonal matrix as a sparse matrix and
use our existing times method.  As Jake points out, we need some
specialization towards DRM by in-memory matrix.  That would give more
generality and just as much speed as a specialized diagonal implementation.

On Sun, Sep 12, 2010 at 12:12 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> +1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also
> timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  On 9/12/10 12:22 PM, Grant Ingersoll wrote:
> On Sep 12, 2010, at 3:12 PM, Jeff Eastman wrote:
>
>> Wow, thanks Jake, that is really definitive. There's a lot of gears whirring silently under the hood here. So Grant's command line script would need to add a "mahout cleaneigens<...>" line
> I believe Jake is referring to the cleansvd command, which is in my list.
Ah, ok, I wondered what that was doing. Much clearer now. I'm going to 
factor a job() method out of its run() method and call it from my SVD 
cluster dumper tests. I will look into calling it from Lanczos with an 
option too.

Re: Using SVD with Canopy/KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 12, 2010, at 3:12 PM, Jeff Eastman wrote:

> Wow, thanks Jake, that is really definitive. There's a lot of gears whirring silently under the hood here. So Grant's command line script would need to add a "mahout cleaneigens <...>" line

I believe Jake is referring to the cleansvd command, which is in my list.

> to really get the valid eigenvectors (or use the second approach). And too, that step will yield the eigenvalues so that either decomposition approach may be used. The problem I'm seeing with doing this programmatically is there is no ready Java method (e.g. job() called by run() in the clustering stuff) which I could use.
> 
> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at least implement a job() method on EVJ.
> 
> +1 on renaming DistributedMatrix.times() to transposeTimes() to avoid confusion.
> 
> +1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?
> 
> On the multiplication correctness, you are right: the unit test compares the results of a.transpose().times(b).
> 
> Thanks for the post,
> Jeff
> 
> 
> On 9/11/10 8:21 PM, Jake Mannix wrote:
>> I really should have chimed in in this thread earlier... sorry folks.
>> 
>>   Regarding the relation between eigenvector decomposition and SVD:  The
>> Lanczos algorithm is specifically for eigenvector decomposition (and hence
>> works on symmetric square matrices), and operates by looking at the set of
>> vectors { v, Av, A(Av), A(A(Av))), ... } and doing some clever recombination
>> on them.  To adapt this to do SVD, you just substitute A = C' * C, for any
>> (non-square) matrix C.  You get a set of eigenvectors of C' * C, which in
>> turn are the right-singular vectors of C (what is called V, if C = U D V').
>>  In particular, if you are representing C as rows of vectors of
>> dimensionality numColumns, then the singular vectors you get out are also of
>> dimensionality numColumns (i.e. they live in the same vector space as your
>> "documents", because they are mixtures of them.
>> 
>>   Regarding DistributedMatrix.times(DistributedMatrix other), yes, it should
>> be called transposeTimes(), but unfortunately, timesTranspose is *not* that
>> efficient, in terms of Map-Reduce passes.  If you can come up with a way to
>> do a single MR pass to compute timesTranspose(), or simply times(), by all
>> means, add it!  But the only matrix multiplication I could do on two large
>> sparse matrices in one pass over the data was what is currently in there.
>> 
>>   Mahout DistributedRowMatrix does *not* currently have the very easy
>> "multiply this matrix by a diagonal matrix", or more generally "multiply by
>> this matrix which is so small as to be able to fit in memory in all of the
>> mappers".  That would be a very helpful addition.
>> 
>>   Regarding correctness of the matrix multiplication: isn't there, right
>> now, a unit test comparing DistributedRowMatrix.times() to
>> DenseMatrix.transpose().times(), and checks that the results are the same
>> for small matrices?  It should be in TestDistributedRowMatrix or
>> something...
>> 
>>   As for outputting eigenvalues (or singular values),
>> DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
>> eigenvalues aren't valid, and also scales all of the eigenvalues down by the
>> max eignenvalue (to avoid floating point overflow), and so the step which
>> spits out the nice correctly scaled (and non-spurious) eigenvector/value
>> pairs is that done by the "cleaneigens" shell script step (c.f.
>> EigenVerificationJob).  This is when EigenVector instances are created.
>> 
>>   It would make sense for EigenVerificationJob to be folded into
>> DistributedLanczosSolver, to reduce confusion.  Maybe add a solver flag on
>> whether you care about the "cleaning" step.
>> 
>>   One more note:  if you really want to get the projection of your original
>> matrix's rows onto the SVD basis (a common desire for dimensionality
>> reduction), then what you want is not V, but U' : U is a set of k vectors,
>> each of which having numRows (i.e. "numDocuments" / "numUsers")
>> dimensionality, but what you want is numRows vectors, each of dimension k
>> (these numRows rows have had their dimensionality reduced from sparse
>> numColumns size down to dense k), which is U'.
>> 
>>   To get U', instead of V, just do DistributedRowMatrix.transpose() on your
>> *original* matrix before feeding it to the DistributedLanczosSolver.  You'll
>> get U, and to get U', just take one more transpose of the
>> DistributedRowMatrix represented by the k-eigenvectors (possibly you want to
>> do the "multiply by D^-1 or D" step here too):
>> 
>>   C = U D V'  -->  C' C = V D U' U D V' = V D^2 V', and
>>   C' = V D U' -->  C C' = U D V' V D U' = U D^2 U'
>> 
>>   Overall, however, this only saves you one map reduce pass, because the
>> original way of getting U' was to: take SVD (k MR passes), then transpose
>> both the output (1 MR pass) and input (1 MR pass), and then multiply them
>> together (1 MR pass) = k+3 MR passes.  The new way is: take transpose (1 MR
>> pass), then take SVD (k MR passes), then transpose the output (1 MR pass) =
>> k + 2 MR passes.
>> 
>>   Of course, it all depends on what you need.  I've often wanted to hang
>> onto the V-vectors, because they are useful for "folding-in" new data into a
>> larger approximate SVD, and they already have the form of "pseudo-documents"
>> and can give you a feel for how you've clustered your input features.
>> 
>>   -jake
>> 
>> On Sat, Sep 11, 2010 at 6:18 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>> 
>>>  +1 Ok, this makes so much sense now. It had been bothering me that U D
>>> would not have the same number of rows as A in the non-square case - our
>>> typical. It's interesting that the
>>> DistributedLanczosSolver.serializeOutput() method does not actually
>>> serialize the eigenvalues. I've fixed that locally using NamedVector but
>>> something like WeightedVectorWritable (not itself a Vector) would be better.
>>> There's also EigenVector objects which might be used here.
>>> 
>>> 
>>> 
>>> On 9/11/10 3:57 PM, Ted Dunning wrote:
>>> 
>>>> Well, firstly the code is doing a singular value decomposition.  This is
>>>> similar to an eigenvector decomposition except that there is no assumption
>>>> that A is square.  With the SVD, we have
>>>> 
>>>>      A = U D V'
>>>> 
>>>> This is in contrast to the eigenvector decomposition which has
>>>> 
>>>>     A = U D U'
>>>> 
>>>> The difference lies in having both left and right singular vectors, U and
>>>> V.
>>>> 
>>>> Practically speaking, with the SVD decomposition, we can consider A or any
>>>> other appropriate matrix to be either a pile of row vectors or a bunch of
>>>> column vectors and we can project either into the reduced dimensional
>>>> representation using U' on the left or V on the right.  In Mahout, the SVD
>>>> only gives us one of U or V (I forget which).  Most commonly, I consider A
>>>> to be row vectors.  This means that I can use the fact that V' V = I to do
>>>> this:
>>>> 
>>>>      A V = U D
>>>> 
>>>> Now, if I wanted to do this some new matrix B, I would need to keep V
>>>> around
>>>> in order to do the multiplication.  But since I just did the SVD of A, all
>>>> I
>>>> need is U and D which is what I think that the Mahout eigen-solver gives
>>>> us.
>>>>  As you say, multiplying by D is trivial because it is diagonal.
>>>> 
>>>> Does that make sense?
>>>> 
>>>> This is similar to what you said, but differs in some details.
>>>> 
>>>> Note also that it is common to split D into two factors:
>>>> 
>>>>     A = U sqrt(D) sqrt(D) V'
>>>> 
>>>> and then use this for the decomposition of A
>>>> 
>>>>     A V sqrt(D)^-1
>>>> 
>>>> The multipication by the inverse of sqrt(D) is trivial since D is
>>>> diagonal.
>>>> 
>>>> 
>>>> On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<jdog@windwardsolutions.com
>>>>> wrote:
>>>>   Ted,
>>>>> Is this because, for any matrix A, its eigenvector matrix P and its
>>>>> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
>>>>> full matrix multiplication since it is diagonal? Could you please
>>>>> elaborate?
>>>>> 
>>>>> Jeff
>>>>> 
>>>>> 
>>>>> 
>>>>> On 9/11/10 2:50 PM, Ted Dunning wrote:
>>>>> 
>>>>>  Should be close.  The matrixMult step may be redundant if you want to
>>>>>> cluster the same data that you decomposed.  That would make the second
>>>>>> transpose unnecessary as well.
>>>>>> 
>>>>>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<gsingers@apache.org
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>  To put this in bin/mahout speak, this would look like, munging some
>>>>>> names
>>>>>> 
>>>>>>> and taking liberties with the actual argument to be passed in:
>>>>>>> 
>>>>>>> bin/mahout svd (original ->    svdOut)
>>>>>>> bin/mahout cleansvd ...
>>>>>>> bin/mahout transpose svdOut ->    svdT
>>>>>>> bin/mahout transpose original ->    originalT
>>>>>>> bin/mahout matrixmult originalT svdT ->    newMatrix
>>>>>>> bin/mahout kmeans newMatrix
>>>>>>> 
>>>>>>> Is that about right?
>>>>>>> 
>>>>>>> 
>>>>>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>>>>> 
>>>>>>>  Ok, the transposed computation seems to work and the cast exception
>>>>>>> was
>>>>>>> caused by my unit test writing LongWritable keys to the testdata file.
>>>>>>> The
>>>>>>> following test produces a comparable answer to the non-distributed
>>>>>>> case.
>>>>>>> I
>>>>>>> still want to rename the method to transposeTimes for clarity. And
>>>>>>> better,
>>>>>>> implement timesTranspose to make this particular computation more
>>>>>>> efficient:
>>>>>>> 
>>>>>>>   public void testKmeansDSVD() throws Exception {
>>>>>>>>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>>>>>    Path output = getTestTempDirPath("output");
>>>>>>>>    Path tmp = getTestTempDirPath("tmp");
>>>>>>>>    Path eigenvectors = new Path(output, "eigenvectors");
>>>>>>>>    int desiredRank = 13;
>>>>>>>>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>>>>>    Configuration config = new Configuration();
>>>>>>>>    solver.setConf(config);
>>>>>>>>    Path testData = getTestTempDirPath("testdata");
>>>>>>>>    int sampleDimension = sampleData.get(0).get().size();
>>>>>>>>    solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>>>>> 
>>>>>>>>  sampleDimension, false, desiredRank);
>>>>>>>     // now multiply the testdata matrix and the eigenvector matrix
>>>>>>>>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>>>>> 
>>>>>>>>  tmp, desiredRank - 1, sampleDimension);
>>>>>>>     JobConf conf = new JobConf(config);
>>>>>>>>    svdT.configure(conf);
>>>>>>>>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>>>>> 
>>>>>>>>  sampleData.size(), sampleDimension);
>>>>>>>     a.configure(conf);
>>>>>>>>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>>>>>    sData.configure(conf);
>>>>>>>> 
>>>>>>>>    // now run the Canopy job to prime kMeans canopies
>>>>>>>>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
>>>>>>>> false,
>>>>>>>> 
>>>>>>>>  false);
>>>>>>>     // now run the KMeans job
>>>>>>>>    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>>>>> 
>>>>>>>>  "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>>>>>     // run ClusterDumper
>>>>>>>>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>>>>> 
>>>>>>>>  "clusters-2"), new Path(output, "clusteredPoints"));
>>>>>>>     clusterDumper.printClusters(termDictionary);
>>>>>>>>  }
>>>>>>>> 
>>>>>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>>>>> 
>>>>>>>>  Looking at the single unit test of DMR.times() it seems to be
>>>>>>>>>  implementing Matrix expected = inputA.transpose().times(inputB), and
>>>>>>>> not
>>>>>>>> 
>>>>>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>>>>>> implemented. But the method still has the wrong name and AFAICT is not
>>>>>>> useful for performing this particular computation. Should I use this
>>>>>>> instead?
>>>>>>> 
>>>>>>>  DistributedRowMatrix sData =
>>>>>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>>>>>> ugh! And it still fails with:
>>>>>>>> 
>>>>>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable
>>>>>>>>> cannot
>>>>>>>>> 
>>>>>>>>>  be cast to org.apache.hadoop.io.IntWritable
>>>>>>>>    at
>>>>>>>> 
>>>>>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>>>> 
>>>>>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>>>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>>>>    at
>>>>>>>>> 
>>>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>>>>>> 7-8
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Sep 12, 2010 at 2:19 PM, Jake Mannix <ja...@gmail.com> wrote:

> > +1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also
> > timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?
> >
>
> Like I replied to Ted, I think we can have a transparent api and just check
> whether the matrix we've been passed is able to be passed around
> via-side-channel serialization to the M/R nodes.


This sounds right to me.

Re: Using SVD with Canopy/KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

Can you update the wiki?

-Grant

On Sep 14, 2010, at 1:44 PM, Jeff Eastman wrote:

> Here's the new set of mahout svd arguments. Entries --cleansvd, --maxError, --minEigenvalue and --inMemory have been added in r997007. See the new tests in TestDistributedLanczosSolverCLI for examples of both forms:
> 
>  --input (-i) input                      Path to job input directory.
>  --output (-o) output                    The directory pathname for output.
>  --numRows (-nr) numRows                 Number of rows of the input matrix
>  --numCols (-nc) numCols                 Number of columns of the input matrix
>  --rank (-r) rank                        Desired decomposition rank (note:
>                                          only roughly 1/4 to 1/3 of these will
>                                          have the top portion of the spectrum)
>  --symmetric (-sym) symmetric            Is the input matrix square and
>                                          symmetric?
>  --cleansvd (-cl) cleansvd               Run the EigenVerificationJob to clean
>                                          the eigenvectors after SVD
>  --maxError (-err) maxError              Maximum acceptable error
>  --minEigenvalue (-mev) minEigenvalue    Minimum eigenvalue to keep the vector
>                                          for
>  --inMemory (-mem) inMemory              Buffer eigen matrix into memory (if
>                                          you have enough!)
>  --help (-h)                             Print out help
>  --tempDir tempDir                       Intermediate output directory
>  --startPhase startPhase                 First phase to run
>  --endPhase endPhase                     Last phase to run
> 
> On 9/14/10 6:55 AM, Jake Mannix wrote:
>> I guess the main thing I'd want to happen in combining EVJ and DLS is to
>> make sure that the final output (changing the semantics of the CLI param is
>> ok) is clear, with it either being the output of EVJ (if that is used), or
>> DLS (if EVJ is not used).  If that can be done, go for it!
>> 
>>   -jake
>> 
>> On Tue, Sep 14, 2010 at 6:30 AM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>> 
>>>  Jake, I see you are on line. I'm inclined to push forward on this despite
>>> the adjustments to DLS --output semantics. Agreed?
>>> 
>>> 
>>> On 9/13/10 10:34 AM, Jeff Eastman wrote:
>>> 
>>>>  r996599 completed the first part. Several additional arguments to EVJ.run
>>>> need to be added to DLS (maxError, minEigenValue, inMemory, also the
>>>> --cleansvn flag itself). Also DLS interprets --output as the
>>>> outputEigenVectorPath and not as the generic output directory so DLS.run()
>>>> will need another argument too. Still want to do this?
>>>> 
>>>> On 9/12/10 2:19 PM, Jake Mannix wrote:
>>>> 
>>>>> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at
>>>>>> least implement a job() method on EVJ.
>>>>>> 
>>>>>>  +1 for having the latter, with a boolean flag in DLS to optionally call
>>>>> EJV
>>>>> after it's done.
>>>>> 
>>>> 
> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Posted by Derek O'Callaghan <de...@ucd.ie>.

Just rereading what I wrote, I think the question at the end should be:

Is there an issue given that one eigenvector is consistently generated 
which doesn't satisfy "Math.abs(1 - entry.getValue().getCosAngle()) < 
maxError" in EigenVerificationJob.pruneEigens() during the clean step, 
leaving you with desiredRank-2 eigenvectors?

Thanks,

Derek

On 20/09/10 18:41, Derek O'Callaghan wrote:
> Hi Jeff, Jake,
>
> I think there may still be an issue with the clean step, even allowing 
> for desiredRank-1 eigenvectors being created. If we run Jeff's 
> TestClusterDumper.testKmeansSVD() as is:
>
> - desiredRank = 15
> - desiredRank -1 (14) raw eigenvectors are created
> - After the clean step, the last (14th) clean eigenvector always 
> contains 0s. This means that we're left with 13 eigenvectors 
> containing non-zero values.
>
> If I change the following:
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, 
> false, desiredRank, 0.5, 0.0, true);
> Path cleanEigenvectors = new Path(output, 
> EigenVerificationJob.CLEAN_EIGENVECTORS);
>
> to:
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, 
> false, desiredRank);
> Path cleanEigenvectors = new Path(output, 
> DistributedLanczosSolver.RAW_EIGENVECTORS);
>
> with desiredRank = 14, I get  14 eigenvectors with non-zero values. 
> This is what I expect, allowing for the fact that desiredRank - 1 
> eigenvectors are returned.
>
> It seems that after generating the raw, and then the clean 
> eigenvectors, I consistently get desiredRank-2 eigenvectors with 
> non-zero values, with the desiredRank-1 eigenvector having all zero 
> values after the clean step. E.g. if I change desiredRank to 10, I'll 
> get 8 eigenvectors with non-zero values after running the raw+clean, 
> with eigenvector 9 containing all zeros, whereas I get 9 non-zero 
> eigenvectors if I just run raw on its own. If you inspect 'p' at the 
> "Matrix sData = a.times(p);" line, you'll see this.
>
> I get this result both with Jeff's test data and my own. Stepping 
> through the code, it seems to always find one vector in the for loop 
> of EigenVerificationJob.pruneEigens() for which "Math.abs(1 - 
> entry.getValue().getCosAngle()) < maxError" is false, and so it isn't 
> added to the prunedEigenMeta list.
>
> Is this expected behaviour? Or, is there an issue given that one 
> eigenvector always contains 0s after the clean, leaving you with 
> desiredRank-2 eigenvectors? Apologies if there's no issue here, and 
> it's just my a lack of understanding on my part.
>
> Thanks,
>
> Derek
>
> On 20/09/10 16:21, Jake Mannix wrote:
>> That last "eigenvector" is, for reasons not entirely clear even to 
>> me, *not*
>> an eigenvector, as the output of EigenVerificationJob will show you 
>> if you
>> remove that "-1".
>>
>> The most sensible patch is to take the user's "desiredRank" and add 
>> one to
>> it, and leave the code otherwise unchanged.
>>
>>    -jake
>>
>> On Mon, Sep 20, 2010 at 7:22 AM, Jeff 
>> Eastman<jd...@windwardsolutions.com>wrote:
>>
>>>   Hi Derek,
>>>
>>> I think this is caused by the fact that the SVD output seems to emit 
>>> only
>>> desiredRank-1 eigenvectors in the rawEigenvectors directory. When 
>>> that is
>>> transposed it would yield a p matrix with zero entries in the last 
>>> column
>>> that you have observed. The code that's doing this is in
>>> DistributedLanczosSolver.serializeOutput() and the line responsible is:
>>>
>>>     for (int i = 0; i<  eigenVectors.numRows() - 1; i++) {
>>>
>>> I thought that curious but don't understand Lanczos well enough yet 
>>> to be
>>> too critical. Perhaps you could try removing the -1 and see if it 
>>> improves
>>> your results.
>>>
>>>
>>>
>>> On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> I've been trying out the latest version of the svd code in
>>>> TestClusterDumper this week (actually I'm using my modified version 
>>>> of it as
>>>> I mentioned in my original post at the start of the thread, with 
>>>> your latest
>>>> changes). I suspect there's a problem with the EigenVerificationJob 
>>>> called
>>>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), 
>>>> using:
>>>>
>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>> false, desiredRank, 0.5, 0.0, true);
>>>>
>>>> The generated 'p' matrix (read from the clean eigenvectors file) will
>>>> always have the value 0 for the (desiredRank - 1) column in each 
>>>> row. E.g.,
>>>> here's the first row:
>>>>
>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>> -0.0025483366872868546, 0.0]
>>>>
>>>> This then means that the sData matrix will have 0s in this column
>>>> following multiplication. However, when I change testKmeansSVD() to 
>>>> run the
>>>> solver without the clean step, and load the raw eigenvectors into 
>>>> 'p' i.e.
>>>> .
>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>> false, desiredRank);
>>>>
>>>> 'p' now has values other than 0 in the last column, e.g. here's the 
>>>> first
>>>> row:
>>>>
>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>> -0.0025483366872868546, -0.04870849090364153]
>>>>
>>>> I'm guessing there's a problem with the clean step here, or is this 
>>>> normal
>>>> behaviour?
>>>>
>>>> FYI I noticed the problem when running the solver + clean on my own 
>>>> data,
>>>> and then running the Dirichlet clusterer on the reduced data. I 
>>>> found that
>>>> after a couple of iterations, things started to go wrong with 
>>>> Dirichlet as
>>>> the following code in UncommonDistribution.rMultinom() was being 
>>>> called:
>>>>
>>>>      // can't happen except for round-off error so we don't care 
>>>> what we
>>>> return here
>>>>      return 0;
>>>>
>>>> I suspect this might be associated with the fact that the last 
>>>> column in
>>>> my reduced data matrix is 0, although I haven't confirmed it yet.
>>>>
>>>> Thanks,
>>>>
>>>> Derek
>>>>
>>>
>>
>

Re: Using SVD with Canopy/KMeans

Posted by Derek O'Callaghan <de...@ucd.ie>.

Hi Jake,

That sounds good, I might get a bit more time tomorrow to look at it 
further myself, I'll let you know if I find anything.

Thanks,

Derek

On 20/09/10 18:48, Jake Mannix wrote:
> Hey Derek,
>
>    I'll have to look in in more detail.  So far, I've gone with "if the
> eigenvectors pass the EigenVerifier's test, I'm down with them" style of
> black-box thinking (I mean, who am I to say what they should be like: if
> they satisfy A.v = a * v, some a high degree of accuracy, their an
> eigenvector!), but this does look pretty fishy.
>
>    -jake
>
> On Mon, Sep 20, 2010 at 10:41 AM, Derek O'Callaghan<derek.ocallaghan@ucd.ie
>    
>> wrote:
>>      
>    
>> Hi Jeff, Jake,
>>
>> I think there may still be an issue with the clean step, even allowing for
>> desiredRank-1 eigenvectors being created. If we run Jeff's
>> TestClusterDumper.testKmeansSVD() as is:
>>
>> - desiredRank = 15
>> - desiredRank -1 (14) raw eigenvectors are created
>> - After the clean step, the last (14th) clean eigenvector always contains
>> 0s. This means that we're left with 13 eigenvectors containing non-zero
>> values.
>>
>> If I change the following:
>>
>>
>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>> false, desiredRank, 0.5, 0.0, true);
>> Path cleanEigenvectors = new Path(output,
>> EigenVerificationJob.CLEAN_EIGENVECTORS);
>>
>> to:
>>
>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>> false, desiredRank);
>> Path cleanEigenvectors = new Path(output,
>> DistributedLanczosSolver.RAW_EIGENVECTORS);
>>
>> with desiredRank = 14, I get  14 eigenvectors with non-zero values. This is
>> what I expect, allowing for the fact that desiredRank - 1 eigenvectors are
>> returned.
>>
>> It seems that after generating the raw, and then the clean eigenvectors, I
>> consistently get desiredRank-2 eigenvectors with non-zero values, with the
>> desiredRank-1 eigenvector having all zero values after the clean step. E.g.
>> if I change desiredRank to 10, I'll get 8 eigenvectors with non-zero values
>> after running the raw+clean, with eigenvector 9 containing all zeros,
>> whereas I get 9 non-zero eigenvectors if I just run raw on its own. If you
>> inspect 'p' at the "Matrix sData = a.times(p);" line, you'll see this.
>>
>> I get this result both with Jeff's test data and my own. Stepping through
>> the code, it seems to always find one vector in the for loop of
>> EigenVerificationJob.pruneEigens() for which "Math.abs(1 -
>> entry.getValue().getCosAngle())<  maxError" is false, and so it isn't added
>> to the prunedEigenMeta list.
>>
>> Is this expected behaviour? Or, is there an issue given that one
>> eigenvector always contains 0s after the clean, leaving you with
>> desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's
>> just my a lack of understanding on my part.
>>
>> Thanks,
>>
>> Derek
>>
>>
>> On 20/09/10 16:21, Jake Mannix wrote:
>>
>>      
>>> That last "eigenvector" is, for reasons not entirely clear even to me,
>>> *not*
>>> an eigenvector, as the output of EigenVerificationJob will show you if you
>>> remove that "-1".
>>>
>>> The most sensible patch is to take the user's "desiredRank" and add one to
>>> it, and leave the code otherwise unchanged.
>>>
>>>    -jake
>>>
>>> On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<jdog@windwardsolutions.com
>>>        
>>>> wrote:
>>>>          
>>>
>>>
>>>        
>>>>   Hi Derek,
>>>>
>>>> I think this is caused by the fact that the SVD output seems to emit only
>>>> desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is
>>>> transposed it would yield a p matrix with zero entries in the last column
>>>> that you have observed. The code that's doing this is in
>>>> DistributedLanczosSolver.serializeOutput() and the line responsible is:
>>>>
>>>>     for (int i = 0; i<   eigenVectors.numRows() - 1; i++) {
>>>>
>>>> I thought that curious but don't understand Lanczos well enough yet to be
>>>> too critical. Perhaps you could try removing the -1 and see if it
>>>> improves
>>>> your results.
>>>>
>>>>
>>>>
>>>> On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Hi Jeff,
>>>>>
>>>>> I've been trying out the latest version of the svd code in
>>>>> TestClusterDumper this week (actually I'm using my modified version of
>>>>> it as
>>>>> I mentioned in my original post at the start of the thread, with your
>>>>> latest
>>>>> changes). I suspect there's a problem with the EigenVerificationJob
>>>>> called
>>>>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(),
>>>>> using:
>>>>>
>>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>>> false, desiredRank, 0.5, 0.0, true);
>>>>>
>>>>> The generated 'p' matrix (read from the clean eigenvectors file) will
>>>>> always have the value 0 for the (desiredRank - 1) column in each row.
>>>>> E.g.,
>>>>> here's the first row:
>>>>>
>>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>>> -0.0025483366872868546, 0.0]
>>>>>
>>>>> This then means that the sData matrix will have 0s in this column
>>>>> following multiplication. However, when I change testKmeansSVD() to run
>>>>> the
>>>>> solver without the clean step, and load the raw eigenvectors into 'p'
>>>>> i.e.
>>>>> .
>>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>>> false, desiredRank);
>>>>>
>>>>> 'p' now has values other than 0 in the last column, e.g. here's the
>>>>> first
>>>>> row:
>>>>>
>>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>>> -0.0025483366872868546, -0.04870849090364153]
>>>>>
>>>>> I'm guessing there's a problem with the clean step here, or is this
>>>>> normal
>>>>> behaviour?
>>>>>
>>>>> FYI I noticed the problem when running the solver + clean on my own
>>>>> data,
>>>>> and then running the Dirichlet clusterer on the reduced data. I found
>>>>> that
>>>>> after a couple of iterations, things started to go wrong with Dirichlet
>>>>> as
>>>>> the following code in UncommonDistribution.rMultinom() was being called:
>>>>>
>>>>>      // can't happen except for round-off error so we don't care what we
>>>>> return here
>>>>>      return 0;
>>>>>
>>>>> I suspect this might be associated with the fact that the last column in
>>>>> my reduced data matrix is 0, although I haven't confirmed it yet.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Derek
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>
>>>>
>>>>          
>>>
>>>
>>>        
>>      
>

Re: Using SVD with Canopy/KMeans

Posted by Jake Mannix <ja...@gmail.com>.

Hey Derek,

  I'll have to look in in more detail.  So far, I've gone with "if the
eigenvectors pass the EigenVerifier's test, I'm down with them" style of
black-box thinking (I mean, who am I to say what they should be like: if
they satisfy A.v = a * v, some a high degree of accuracy, their an
eigenvector!), but this does look pretty fishy.

  -jake

On Mon, Sep 20, 2010 at 10:41 AM, Derek O'Callaghan <derek.ocallaghan@ucd.ie
> wrote:

> Hi Jeff, Jake,
>
> I think there may still be an issue with the clean step, even allowing for
> desiredRank-1 eigenvectors being created. If we run Jeff's
> TestClusterDumper.testKmeansSVD() as is:
>
> - desiredRank = 15
> - desiredRank -1 (14) raw eigenvectors are created
> - After the clean step, the last (14th) clean eigenvector always contains
> 0s. This means that we're left with 13 eigenvectors containing non-zero
> values.
>
> If I change the following:
>
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
> false, desiredRank, 0.5, 0.0, true);
> Path cleanEigenvectors = new Path(output,
> EigenVerificationJob.CLEAN_EIGENVECTORS);
>
> to:
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
> false, desiredRank);
> Path cleanEigenvectors = new Path(output,
> DistributedLanczosSolver.RAW_EIGENVECTORS);
>
> with desiredRank = 14, I get  14 eigenvectors with non-zero values. This is
> what I expect, allowing for the fact that desiredRank - 1 eigenvectors are
> returned.
>
> It seems that after generating the raw, and then the clean eigenvectors, I
> consistently get desiredRank-2 eigenvectors with non-zero values, with the
> desiredRank-1 eigenvector having all zero values after the clean step. E.g.
> if I change desiredRank to 10, I'll get 8 eigenvectors with non-zero values
> after running the raw+clean, with eigenvector 9 containing all zeros,
> whereas I get 9 non-zero eigenvectors if I just run raw on its own. If you
> inspect 'p' at the "Matrix sData = a.times(p);" line, you'll see this.
>
> I get this result both with Jeff's test data and my own. Stepping through
> the code, it seems to always find one vector in the for loop of
> EigenVerificationJob.pruneEigens() for which "Math.abs(1 -
> entry.getValue().getCosAngle()) < maxError" is false, and so it isn't added
> to the prunedEigenMeta list.
>
> Is this expected behaviour? Or, is there an issue given that one
> eigenvector always contains 0s after the clean, leaving you with
> desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's
> just my a lack of understanding on my part.
>
> Thanks,
>
> Derek
>
>
> On 20/09/10 16:21, Jake Mannix wrote:
>
>> That last "eigenvector" is, for reasons not entirely clear even to me,
>> *not*
>> an eigenvector, as the output of EigenVerificationJob will show you if you
>> remove that "-1".
>>
>> The most sensible patch is to take the user's "desiredRank" and add one to
>> it, and leave the code otherwise unchanged.
>>
>>   -jake
>>
>> On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<jdog@windwardsolutions.com
>> >wrote:
>>
>>
>>
>>>  Hi Derek,
>>>
>>> I think this is caused by the fact that the SVD output seems to emit only
>>> desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is
>>> transposed it would yield a p matrix with zero entries in the last column
>>> that you have observed. The code that's doing this is in
>>> DistributedLanczosSolver.serializeOutput() and the line responsible is:
>>>
>>>    for (int i = 0; i<  eigenVectors.numRows() - 1; i++) {
>>>
>>> I thought that curious but don't understand Lanczos well enough yet to be
>>> too critical. Perhaps you could try removing the -1 and see if it
>>> improves
>>> your results.
>>>
>>>
>>>
>>> On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
>>>
>>>
>>>
>>>> Hi Jeff,
>>>>
>>>> I've been trying out the latest version of the svd code in
>>>> TestClusterDumper this week (actually I'm using my modified version of
>>>> it as
>>>> I mentioned in my original post at the start of the thread, with your
>>>> latest
>>>> changes). I suspect there's a problem with the EigenVerificationJob
>>>> called
>>>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(),
>>>> using:
>>>>
>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>> false, desiredRank, 0.5, 0.0, true);
>>>>
>>>> The generated 'p' matrix (read from the clean eigenvectors file) will
>>>> always have the value 0 for the (desiredRank - 1) column in each row.
>>>> E.g.,
>>>> here's the first row:
>>>>
>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>> -0.0025483366872868546, 0.0]
>>>>
>>>> This then means that the sData matrix will have 0s in this column
>>>> following multiplication. However, when I change testKmeansSVD() to run
>>>> the
>>>> solver without the clean step, and load the raw eigenvectors into 'p'
>>>> i.e.
>>>> .
>>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>>> false, desiredRank);
>>>>
>>>> 'p' now has values other than 0 in the last column, e.g. here's the
>>>> first
>>>> row:
>>>>
>>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>>> -0.0025483366872868546, -0.04870849090364153]
>>>>
>>>> I'm guessing there's a problem with the clean step here, or is this
>>>> normal
>>>> behaviour?
>>>>
>>>> FYI I noticed the problem when running the solver + clean on my own
>>>> data,
>>>> and then running the Dirichlet clusterer on the reduced data. I found
>>>> that
>>>> after a couple of iterations, things started to go wrong with Dirichlet
>>>> as
>>>> the following code in UncommonDistribution.rMultinom() was being called:
>>>>
>>>>     // can't happen except for round-off error so we don't care what we
>>>> return here
>>>>     return 0;
>>>>
>>>> I suspect this might be associated with the fact that the last column in
>>>> my reduced data matrix is 0, although I haven't confirmed it yet.
>>>>
>>>> Thanks,
>>>>
>>>> Derek
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>

Re: Using SVD with Canopy/KMeans

Posted by Derek O'Callaghan <de...@ucd.ie>.

Hi Jeff, Jake,

I think there may still be an issue with the clean step, even allowing 
for desiredRank-1 eigenvectors being created. If we run Jeff's 
TestClusterDumper.testKmeansSVD() as is:

- desiredRank = 15
- desiredRank -1 (14) raw eigenvectors are created
- After the clean step, the last (14th) clean eigenvector always 
contains 0s. This means that we're left with 13 eigenvectors containing 
non-zero values.

If I change the following:

solver.run(testData, output, tmp, sampleData.size(), sampleDimension, 
false, desiredRank, 0.5, 0.0, true);
Path cleanEigenvectors = new Path(output, 
EigenVerificationJob.CLEAN_EIGENVECTORS);

to:

solver.run(testData, output, tmp, sampleData.size(), sampleDimension, 
false, desiredRank);
Path cleanEigenvectors = new Path(output, 
DistributedLanczosSolver.RAW_EIGENVECTORS);

with desiredRank = 14, I get  14 eigenvectors with non-zero values. This 
is what I expect, allowing for the fact that desiredRank - 1 
eigenvectors are returned.

It seems that after generating the raw, and then the clean eigenvectors, 
I consistently get desiredRank-2 eigenvectors with non-zero values, with 
the desiredRank-1 eigenvector having all zero values after the clean 
step. E.g. if I change desiredRank to 10, I'll get 8 eigenvectors with 
non-zero values after running the raw+clean, with eigenvector 9 
containing all zeros, whereas I get 9 non-zero eigenvectors if I just 
run raw on its own. If you inspect 'p' at the "Matrix sData = 
a.times(p);" line, you'll see this.

I get this result both with Jeff's test data and my own. Stepping 
through the code, it seems to always find one vector in the for loop of 
EigenVerificationJob.pruneEigens() for which "Math.abs(1 - 
entry.getValue().getCosAngle()) < maxError" is false, and so it isn't 
added to the prunedEigenMeta list.

Is this expected behaviour? Or, is there an issue given that one 
eigenvector always contains 0s after the clean, leaving you with 
desiredRank-2 eigenvectors? Apologies if there's no issue here, and it's 
just my a lack of understanding on my part.

Thanks,

Derek

On 20/09/10 16:21, Jake Mannix wrote:
> That last "eigenvector" is, for reasons not entirely clear even to me, *not*
> an eigenvector, as the output of EigenVerificationJob will show you if you
> remove that "-1".
>
> The most sensible patch is to take the user's "desiredRank" and add one to
> it, and leave the code otherwise unchanged.
>
>    -jake
>
> On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>    
>>   Hi Derek,
>>
>> I think this is caused by the fact that the SVD output seems to emit only
>> desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is
>> transposed it would yield a p matrix with zero entries in the last column
>> that you have observed. The code that's doing this is in
>> DistributedLanczosSolver.serializeOutput() and the line responsible is:
>>
>>     for (int i = 0; i<  eigenVectors.numRows() - 1; i++) {
>>
>> I thought that curious but don't understand Lanczos well enough yet to be
>> too critical. Perhaps you could try removing the -1 and see if it improves
>> your results.
>>
>>
>>
>> On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
>>
>>      
>>> Hi Jeff,
>>>
>>> I've been trying out the latest version of the svd code in
>>> TestClusterDumper this week (actually I'm using my modified version of it as
>>> I mentioned in my original post at the start of the thread, with your latest
>>> changes). I suspect there's a problem with the EigenVerificationJob called
>>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using:
>>>
>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>> false, desiredRank, 0.5, 0.0, true);
>>>
>>> The generated 'p' matrix (read from the clean eigenvectors file) will
>>> always have the value 0 for the (desiredRank - 1) column in each row. E.g.,
>>> here's the first row:
>>>
>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>> -0.0025483366872868546, 0.0]
>>>
>>> This then means that the sData matrix will have 0s in this column
>>> following multiplication. However, when I change testKmeansSVD() to run the
>>> solver without the clean step, and load the raw eigenvectors into 'p' i.e.
>>> .
>>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>>> false, desiredRank);
>>>
>>> 'p' now has values other than 0 in the last column, e.g. here's the first
>>> row:
>>>
>>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>>> -0.0025483366872868546, -0.04870849090364153]
>>>
>>> I'm guessing there's a problem with the clean step here, or is this normal
>>> behaviour?
>>>
>>> FYI I noticed the problem when running the solver + clean on my own data,
>>> and then running the Dirichlet clusterer on the reduced data. I found that
>>> after a couple of iterations, things started to go wrong with Dirichlet as
>>> the following code in UncommonDistribution.rMultinom() was being called:
>>>
>>>      // can't happen except for round-off error so we don't care what we
>>> return here
>>>      return 0;
>>>
>>> I suspect this might be associated with the fact that the last column in
>>> my reduced data matrix is 0, although I haven't confirmed it yet.
>>>
>>> Thanks,
>>>
>>> Derek
>>>
>>>        
>>
>>      
>
>

Re: Using SVD with Canopy/KMeans

Posted by Jake Mannix <ja...@gmail.com>.

That last "eigenvector" is, for reasons not entirely clear even to me, *not*
an eigenvector, as the output of EigenVerificationJob will show you if you
remove that "-1".

The most sensible patch is to take the user's "desiredRank" and add one to
it, and leave the code otherwise unchanged.

  -jake

On Mon, Sep 20, 2010 at 7:22 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  Hi Derek,
>
> I think this is caused by the fact that the SVD output seems to emit only
> desiredRank-1 eigenvectors in the rawEigenvectors directory. When that is
> transposed it would yield a p matrix with zero entries in the last column
> that you have observed. The code that's doing this is in
> DistributedLanczosSolver.serializeOutput() and the line responsible is:
>
>    for (int i = 0; i < eigenVectors.numRows() - 1; i++) {
>
> I thought that curious but don't understand Lanczos well enough yet to be
> too critical. Perhaps you could try removing the -1 and see if it improves
> your results.
>
>
>
> On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
>
>> Hi Jeff,
>>
>> I've been trying out the latest version of the svd code in
>> TestClusterDumper this week (actually I'm using my modified version of it as
>> I mentioned in my original post at the start of the thread, with your latest
>> changes). I suspect there's a problem with the EigenVerificationJob called
>> from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using:
>>
>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>> false, desiredRank, 0.5, 0.0, true);
>>
>> The generated 'p' matrix (read from the clean eigenvectors file) will
>> always have the value 0 for the (desiredRank - 1) column in each row. E.g.,
>> here's the first row:
>>
>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>> -0.0025483366872868546, 0.0]
>>
>> This then means that the sData matrix will have 0s in this column
>> following multiplication. However, when I change testKmeansSVD() to run the
>> solver without the clean step, and load the raw eigenvectors into 'p' i.e.
>> .
>> solver.run(testData, output, tmp, sampleData.size(), sampleDimension,
>> false, desiredRank);
>>
>> 'p' now has values other than 0 in the last column, e.g. here's the first
>> row:
>>
>> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932,
>> 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4,
>> -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5,
>> -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4,
>> -0.0025483366872868546, -0.04870849090364153]
>>
>> I'm guessing there's a problem with the clean step here, or is this normal
>> behaviour?
>>
>> FYI I noticed the problem when running the solver + clean on my own data,
>> and then running the Dirichlet clusterer on the reduced data. I found that
>> after a couple of iterations, things started to go wrong with Dirichlet as
>> the following code in UncommonDistribution.rMultinom() was being called:
>>
>>     // can't happen except for round-off error so we don't care what we
>> return here
>>     return 0;
>>
>> I suspect this might be associated with the fact that the last column in
>> my reduced data matrix is 0, although I haven't confirmed it yet.
>>
>> Thanks,
>>
>> Derek
>>
>
>


-- 
  -jake

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hi Derek,

I think this is caused by the fact that the SVD output seems to emit 
only desiredRank-1 eigenvectors in the rawEigenvectors directory. When 
that is transposed it would yield a p matrix with zero entries in the 
last column that you have observed. The code that's doing this is in 
DistributedLanczosSolver.serializeOutput() and the line responsible is:

     for (int i = 0; i < eigenVectors.numRows() - 1; i++) {

I thought that curious but don't understand Lanczos well enough yet to 
be too critical. Perhaps you could try removing the -1 and see if it 
improves your results.


On 9/18/10 9:58 AM, Derek O'Callaghan wrote:
> Hi Jeff,
>
> I've been trying out the latest version of the svd code in TestClusterDumper this week (actually I'm using my modified version of it as I mentioned in my original post at the start of the thread, with your latest changes). I suspect there's a problem with the EigenVerificationJob called from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using:
>
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank, 0.5, 0.0, true);
>
> The generated 'p' matrix (read from the clean eigenvectors file) will always have the value 0 for the (desiredRank - 1) column in each row. E.g., here's the first row:
>
> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, 0.0]
>
> This then means that the sData matrix will have 0s in this column following multiplication. However, when I change testKmeansSVD() to run the solver without the clean step, and load the raw eigenvectors into 'p' i.e.
> .
> solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank);
>
> 'p' now has values other than 0 in the last column, e.g. here's the first row:
>
> [-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, -0.04870849090364153]
>
> I'm guessing there's a problem with the clean step here, or is this normal behaviour?
>
> FYI I noticed the problem when running the solver + clean on my own data, and then running the Dirichlet clusterer on the reduced data. I found that after a couple of iterations, things started to go wrong with Dirichlet as the following code in UncommonDistribution.rMultinom() was being called:
>
>      // can't happen except for round-off error so we don't care what we return here
>      return 0;
>
> I suspect this might be associated with the fact that the last column in my reduced data matrix is 0, although I haven't confirmed it yet.
>
> Thanks,
>
> Derek

Re: Using SVD with Canopy/KMeans

Posted by Derek O'Callaghan <de...@ucd.ie>.

Hi Jeff,

I've been trying out the latest version of the svd code in TestClusterDumper this week (actually I'm using my modified version of it as I mentioned in my original post at the start of the thread, with your latest changes). I suspect there's a problem with the EigenVerificationJob called from the svd solver. Looking at TestClusterDumper.testKmeansSVD(), using:

solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank, 0.5, 0.0, true);

The generated 'p' matrix (read from the clean eigenvectors file) will always have the value 0 for the (desiredRank - 1) column in each row. E.g., here's the first row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, 0.0]

This then means that the sData matrix will have 0s in this column following multiplication. However, when I change testKmeansSVD() to run the solver without the clean step, and load the raw eigenvectors into 'p' i.e.
.
solver.run(testData, output, tmp, sampleData.size(), sampleDimension, false, desiredRank);

'p' now has values other than 0 in the last column, e.g. here's the first row:

[-0.02236546375417089, 0.0051677900486854144, -0.00498439866649932, 0.0018666209551644673, 0.4313115409222268, 7.672659010256923E-4, -2.295620562705387E-4, -0.0012505553313125165, 9.679192928269636E-5, -4.529759471821197E-4, 0.01162786445974299, 2.1573486863433563E-4, -0.0025483366872868546, -0.04870849090364153]

I'm guessing there's a problem with the clean step here, or is this normal behaviour?

FYI I noticed the problem when running the solver + clean on my own data, and then running the Dirichlet clusterer on the reduced data. I found that after a couple of iterations, things started to go wrong with Dirichlet as the following code in UncommonDistribution.rMultinom() was being called:

    // can't happen except for round-off error so we don't care what we return here
    return 0;

I suspect this might be associated with the fact that the last column in my reduced data matrix is 0, although I haven't confirmed it yet.

Thanks,

Derek

----- Original Message -----
From: Jeff Eastman <jd...@windwardsolutions.com>
Date: Tuesday, September 14, 2010 6:45 pm
Subject: Re: Using SVD with Canopy/KMeans
To: dev@mahout.apache.org

>  Here's the new set of mahout svd arguments. Entries --
> cleansvd, --maxError, --minEigenvalue and --inMemory have been 
> added in r997007. See the new tests in 
> TestDistributedLanczosSolverCLI for examples of both forms:
> 
>   --input (-i) 
> input                      Path to job input directory.
>   --output (-o) 
> output                    The directory pathname for output.
>   --numRows (-nr) 
> numRows                 Number of rows of the input matrix
>   --numCols (-nc) 
> numCols                 Number of columns of the input matrix
>   --rank (-r) 
> rank                        Desired decomposition rank (note:
>                                           only roughly 1/4 to 1/3 of these will
>                                           have the top portion of the spectrum)
>   --symmetric (-sym) 
> symmetric            Is the input matrix square and
>                                           symmetric?
>   --cleansvd (-cl) 
> cleansvd               Run the EigenVerificationJob to clean
>                                           the eigenvectors after SVD
>   --maxError (-err) 
> maxError              Maximum acceptable error
>   --minEigenvalue (-mev) minEigenvalue    
> Minimum eigenvalue to keep the vector
>                                           for
>   --inMemory (-mem) 
> inMemory              Buffer eigen matrix into memory (if
>                                           you have enough!)
>   --help (-
> h)                             Print out help
>   --tempDir 
> tempDir                       Intermediate output directory
>   --startPhase 
> startPhase                 First phase to run
>   --endPhase 
> endPhase                     Last phase to run
> 
> On 9/14/10 6:55 AM, Jake Mannix wrote:
> >I guess the main thing I'd want to happen in combining EVJ and 
> DLS is to
> >make sure that the final output (changing the semantics of the 
> CLI param is
> >ok) is clear, with it either being the output of EVJ (if that 
> is used), or
> >DLS (if EVJ is not used).  If that can be done, go for it!
> >
> >   -jake
> >
> >On Tue, Sep 14, 2010 at 6:30 AM, Jeff 
> Eastman<jd...@windwardsolutions.com>wrote:>
> >>  Jake, I see you are on line. I'm inclined to push 
> forward on this despite
> >>the adjustments to DLS --output semantics. Agreed?
> >>
> >>
> >>On 9/13/10 10:34 AM, Jeff Eastman wrote:
> >>
> >>>  r996599 completed the first part. Several additional 
> arguments to EVJ.run
> >>>need to be added to DLS (maxError, minEigenValue, inMemory, 
> also the
> >>>--cleansvn flag itself). Also DLS interprets --output as the
> >>>outputEigenVectorPath and not as the generic output directory 
> so DLS.run()
> >>>will need another argument too. Still want to do this?
> >>>
> >>>On 9/12/10 2:19 PM, Jake Mannix wrote:
> >>>
> >>>>+1 on folding EigenVerificationJob into 
> DistributedLanczosSolver. Or, at
> >>>>>least implement a job() method on EVJ.
> >>>>>
> >>>>>  +1 for having the latter, with a boolean flag in DLS 
> to optionally call
> >>>>EJV
> >>>>after it's done.
> >>>>
> >>>
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Here's the new set of mahout svd arguments. Entries --cleansvd, 
--maxError, --minEigenvalue and --inMemory have been added in r997007. 
See the new tests in TestDistributedLanczosSolverCLI for examples of 
both forms:

   --input (-i) input                      Path to job input directory.
   --output (-o) output                    The directory pathname for 
output.
   --numRows (-nr) numRows                 Number of rows of the input 
matrix
   --numCols (-nc) numCols                 Number of columns of the 
input matrix
   --rank (-r) rank                        Desired decomposition rank 
(note:
                                           only roughly 1/4 to 1/3 of 
these will
                                           have the top portion of the 
spectrum)
   --symmetric (-sym) symmetric            Is the input matrix square and
                                           symmetric?
   --cleansvd (-cl) cleansvd               Run the EigenVerificationJob 
to clean
                                           the eigenvectors after SVD
   --maxError (-err) maxError              Maximum acceptable error
   --minEigenvalue (-mev) minEigenvalue    Minimum eigenvalue to keep 
the vector
                                           for
   --inMemory (-mem) inMemory              Buffer eigen matrix into 
memory (if
                                           you have enough!)
   --help (-h)                             Print out help
   --tempDir tempDir                       Intermediate output directory
   --startPhase startPhase                 First phase to run
   --endPhase endPhase                     Last phase to run

On 9/14/10 6:55 AM, Jake Mannix wrote:
> I guess the main thing I'd want to happen in combining EVJ and DLS is to
> make sure that the final output (changing the semantics of the CLI param is
> ok) is clear, with it either being the output of EVJ (if that is used), or
> DLS (if EVJ is not used).  If that can be done, go for it!
>
>    -jake
>
> On Tue, Sep 14, 2010 at 6:30 AM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>>   Jake, I see you are on line. I'm inclined to push forward on this despite
>> the adjustments to DLS --output semantics. Agreed?
>>
>>
>> On 9/13/10 10:34 AM, Jeff Eastman wrote:
>>
>>>   r996599 completed the first part. Several additional arguments to EVJ.run
>>> need to be added to DLS (maxError, minEigenValue, inMemory, also the
>>> --cleansvn flag itself). Also DLS interprets --output as the
>>> outputEigenVectorPath and not as the generic output directory so DLS.run()
>>> will need another argument too. Still want to do this?
>>>
>>> On 9/12/10 2:19 PM, Jake Mannix wrote:
>>>
>>>> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at
>>>>> least implement a job() method on EVJ.
>>>>>
>>>>>   +1 for having the latter, with a boolean flag in DLS to optionally call
>>>> EJV
>>>> after it's done.
>>>>
>>>

Re: Using SVD with Canopy/KMeans

Posted by Jake Mannix <ja...@gmail.com>.

I guess the main thing I'd want to happen in combining EVJ and DLS is to
make sure that the final output (changing the semantics of the CLI param is
ok) is clear, with it either being the output of EVJ (if that is used), or
DLS (if EVJ is not used).  If that can be done, go for it!

  -jake

On Tue, Sep 14, 2010 at 6:30 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  Jake, I see you are on line. I'm inclined to push forward on this despite
> the adjustments to DLS --output semantics. Agreed?
>
>
> On 9/13/10 10:34 AM, Jeff Eastman wrote:
>
>>  r996599 completed the first part. Several additional arguments to EVJ.run
>> need to be added to DLS (maxError, minEigenValue, inMemory, also the
>> --cleansvn flag itself). Also DLS interprets --output as the
>> outputEigenVectorPath and not as the generic output directory so DLS.run()
>> will need another argument too. Still want to do this?
>>
>> On 9/12/10 2:19 PM, Jake Mannix wrote:
>>
>>> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at
>>>> least implement a job() method on EVJ.
>>>>
>>>>  +1 for having the latter, with a boolean flag in DLS to optionally call
>>> EJV
>>> after it's done.
>>>
>>
>>
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Jake, I see you are on line. I'm inclined to push forward on this 
despite the adjustments to DLS --output semantics. Agreed?

On 9/13/10 10:34 AM, Jeff Eastman wrote:
>  r996599 completed the first part. Several additional arguments to 
> EVJ.run need to be added to DLS (maxError, minEigenValue, inMemory, 
> also the --cleansvn flag itself). Also DLS interprets --output as the 
> outputEigenVectorPath and not as the generic output directory so 
> DLS.run() will need another argument too. Still want to do this?
>
> On 9/12/10 2:19 PM, Jake Mannix wrote:
>>> +1 on folding EigenVerificationJob into DistributedLanczosSolver. 
>>> Or, at
>>> least implement a job() method on EVJ.
>>>
>> +1 for having the latter, with a boolean flag in DLS to optionally 
>> call EJV
>> after it's done.
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  r996599 completed the first part. Several additional arguments to 
EVJ.run need to be added to DLS (maxError, minEigenValue, inMemory, also 
the --cleansvn flag itself). Also DLS interprets --output as the 
outputEigenVectorPath and not as the generic output directory so 
DLS.run() will need another argument too. Still want to do this?

On 9/12/10 2:19 PM, Jake Mannix wrote:
>> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at
>> least implement a job() method on EVJ.
>>
> +1 for having the latter, with a boolean flag in DLS to optionally call EJV
> after it's done.

Re: Using SVD with Canopy/KMeans

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, Sep 12, 2010 at 12:12 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

>  Wow, thanks Jake, that is really definitive. There's a lot of gears
> whirring silently under the hood here. So Grant's command line script would
> need to add a "mahout cleaneigens <...>" line to really get the valid
> eigenvectors (or use the second approach). And too, that step will yield the
> eigenvalues so that either decomposition approach may be used. The problem
> I'm seeing with doing this programmatically is there is no ready Java method
> (e.g. job() called by run() in the clustering stuff) which I could use.
>

Not a problem.  Sorry to be out of this loop so long.


> +1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at
> least implement a job() method on EVJ.
>

+1 for having the latter, with a boolean flag in DLS to optionally call EJV
after it's done.


> +1 on renaming DistributedMatrix.times() to transposeTimes() to avoid
> confusion.
>

+1  I actually though I'd put it even in the javadocs, but apparently I
didn't.


> +1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also
> timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?
>

Like I replied to Ted, I think we can have a transparent api and just check
whether the matrix we've been passed is able to be passed around
via-side-channel serialization to the M/R nodes.

  -jake

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Wow, thanks Jake, that is really definitive. There's a lot of gears 
whirring silently under the hood here. So Grant's command line script 
would need to add a "mahout cleaneigens <...>" line to really get the 
valid eigenvectors (or use the second approach). And too, that step will 
yield the eigenvalues so that either decomposition approach may be used. 
The problem I'm seeing with doing this programmatically is there is no 
ready Java method (e.g. job() called by run() in the clustering stuff) 
which I could use.

+1 on folding EigenVerificationJob into DistributedLanczosSolver. Or, at 
least implement a job() method on EVJ.

+1 on renaming DistributedMatrix.times() to transposeTimes() to avoid 
confusion.

+1 on adding DistributedMatrix.timesDiagonal(Matrix) and perhaps also 
timesDiagonal(Vector)? Perhaps after the 20.2 retrofit?

On the multiplication correctness, you are right: the unit test compares 
the results of a.transpose().times(b).

Thanks for the post,
Jeff


On 9/11/10 8:21 PM, Jake Mannix wrote:
> I really should have chimed in in this thread earlier... sorry folks.
>
>    Regarding the relation between eigenvector decomposition and SVD:  The
> Lanczos algorithm is specifically for eigenvector decomposition (and hence
> works on symmetric square matrices), and operates by looking at the set of
> vectors { v, Av, A(Av), A(A(Av))), ... } and doing some clever recombination
> on them.  To adapt this to do SVD, you just substitute A = C' * C, for any
> (non-square) matrix C.  You get a set of eigenvectors of C' * C, which in
> turn are the right-singular vectors of C (what is called V, if C = U D V').
>   In particular, if you are representing C as rows of vectors of
> dimensionality numColumns, then the singular vectors you get out are also of
> dimensionality numColumns (i.e. they live in the same vector space as your
> "documents", because they are mixtures of them.
>
>    Regarding DistributedMatrix.times(DistributedMatrix other), yes, it should
> be called transposeTimes(), but unfortunately, timesTranspose is *not* that
> efficient, in terms of Map-Reduce passes.  If you can come up with a way to
> do a single MR pass to compute timesTranspose(), or simply times(), by all
> means, add it!  But the only matrix multiplication I could do on two large
> sparse matrices in one pass over the data was what is currently in there.
>
>    Mahout DistributedRowMatrix does *not* currently have the very easy
> "multiply this matrix by a diagonal matrix", or more generally "multiply by
> this matrix which is so small as to be able to fit in memory in all of the
> mappers".  That would be a very helpful addition.
>
>    Regarding correctness of the matrix multiplication: isn't there, right
> now, a unit test comparing DistributedRowMatrix.times() to
> DenseMatrix.transpose().times(), and checks that the results are the same
> for small matrices?  It should be in TestDistributedRowMatrix or
> something...
>
>    As for outputting eigenvalues (or singular values),
> DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
> eigenvalues aren't valid, and also scales all of the eigenvalues down by the
> max eignenvalue (to avoid floating point overflow), and so the step which
> spits out the nice correctly scaled (and non-spurious) eigenvector/value
> pairs is that done by the "cleaneigens" shell script step (c.f.
> EigenVerificationJob).  This is when EigenVector instances are created.
>
>    It would make sense for EigenVerificationJob to be folded into
> DistributedLanczosSolver, to reduce confusion.  Maybe add a solver flag on
> whether you care about the "cleaning" step.
>
>    One more note:  if you really want to get the projection of your original
> matrix's rows onto the SVD basis (a common desire for dimensionality
> reduction), then what you want is not V, but U' : U is a set of k vectors,
> each of which having numRows (i.e. "numDocuments" / "numUsers")
> dimensionality, but what you want is numRows vectors, each of dimension k
> (these numRows rows have had their dimensionality reduced from sparse
> numColumns size down to dense k), which is U'.
>
>    To get U', instead of V, just do DistributedRowMatrix.transpose() on your
> *original* matrix before feeding it to the DistributedLanczosSolver.  You'll
> get U, and to get U', just take one more transpose of the
> DistributedRowMatrix represented by the k-eigenvectors (possibly you want to
> do the "multiply by D^-1 or D" step here too):
>
>    C = U D V'  -->  C' C = V D U' U D V' = V D^2 V', and
>    C' = V D U' -->  C C' = U D V' V D U' = U D^2 U'
>
>    Overall, however, this only saves you one map reduce pass, because the
> original way of getting U' was to: take SVD (k MR passes), then transpose
> both the output (1 MR pass) and input (1 MR pass), and then multiply them
> together (1 MR pass) = k+3 MR passes.  The new way is: take transpose (1 MR
> pass), then take SVD (k MR passes), then transpose the output (1 MR pass) =
> k + 2 MR passes.
>
>    Of course, it all depends on what you need.  I've often wanted to hang
> onto the V-vectors, because they are useful for "folding-in" new data into a
> larger approximate SVD, and they already have the form of "pseudo-documents"
> and can give you a feel for how you've clustered your input features.
>
>    -jake
>
> On Sat, Sep 11, 2010 at 6:18 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>>   +1 Ok, this makes so much sense now. It had been bothering me that U D
>> would not have the same number of rows as A in the non-square case - our
>> typical. It's interesting that the
>> DistributedLanczosSolver.serializeOutput() method does not actually
>> serialize the eigenvalues. I've fixed that locally using NamedVector but
>> something like WeightedVectorWritable (not itself a Vector) would be better.
>> There's also EigenVector objects which might be used here.
>>
>>
>>
>> On 9/11/10 3:57 PM, Ted Dunning wrote:
>>
>>> Well, firstly the code is doing a singular value decomposition.  This is
>>> similar to an eigenvector decomposition except that there is no assumption
>>> that A is square.  With the SVD, we have
>>>
>>>       A = U D V'
>>>
>>> This is in contrast to the eigenvector decomposition which has
>>>
>>>      A = U D U'
>>>
>>> The difference lies in having both left and right singular vectors, U and
>>> V.
>>>
>>> Practically speaking, with the SVD decomposition, we can consider A or any
>>> other appropriate matrix to be either a pile of row vectors or a bunch of
>>> column vectors and we can project either into the reduced dimensional
>>> representation using U' on the left or V on the right.  In Mahout, the SVD
>>> only gives us one of U or V (I forget which).  Most commonly, I consider A
>>> to be row vectors.  This means that I can use the fact that V' V = I to do
>>> this:
>>>
>>>       A V = U D
>>>
>>> Now, if I wanted to do this some new matrix B, I would need to keep V
>>> around
>>> in order to do the multiplication.  But since I just did the SVD of A, all
>>> I
>>> need is U and D which is what I think that the Mahout eigen-solver gives
>>> us.
>>>   As you say, multiplying by D is trivial because it is diagonal.
>>>
>>> Does that make sense?
>>>
>>> This is similar to what you said, but differs in some details.
>>>
>>> Note also that it is common to split D into two factors:
>>>
>>>      A = U sqrt(D) sqrt(D) V'
>>>
>>> and then use this for the decomposition of A
>>>
>>>      A V sqrt(D)^-1
>>>
>>> The multipication by the inverse of sqrt(D) is trivial since D is
>>> diagonal.
>>>
>>>
>>> On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<jdog@windwardsolutions.com
>>>> wrote:
>>>    Ted,
>>>> Is this because, for any matrix A, its eigenvector matrix P and its
>>>> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
>>>> full matrix multiplication since it is diagonal? Could you please
>>>> elaborate?
>>>>
>>>> Jeff
>>>>
>>>>
>>>>
>>>> On 9/11/10 2:50 PM, Ted Dunning wrote:
>>>>
>>>>   Should be close.  The matrixMult step may be redundant if you want to
>>>>> cluster the same data that you decomposed.  That would make the second
>>>>> transpose unnecessary as well.
>>>>>
>>>>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<gsingers@apache.org
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>   To put this in bin/mahout speak, this would look like, munging some
>>>>> names
>>>>>
>>>>>> and taking liberties with the actual argument to be passed in:
>>>>>>
>>>>>> bin/mahout svd (original ->    svdOut)
>>>>>> bin/mahout cleansvd ...
>>>>>> bin/mahout transpose svdOut ->    svdT
>>>>>> bin/mahout transpose original ->    originalT
>>>>>> bin/mahout matrixmult originalT svdT ->    newMatrix
>>>>>> bin/mahout kmeans newMatrix
>>>>>>
>>>>>> Is that about right?
>>>>>>
>>>>>>
>>>>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>>>>
>>>>>>   Ok, the transposed computation seems to work and the cast exception
>>>>>> was
>>>>>> caused by my unit test writing LongWritable keys to the testdata file.
>>>>>> The
>>>>>> following test produces a comparable answer to the non-distributed
>>>>>> case.
>>>>>> I
>>>>>> still want to rename the method to transposeTimes for clarity. And
>>>>>> better,
>>>>>> implement timesTranspose to make this particular computation more
>>>>>> efficient:
>>>>>>
>>>>>>    public void testKmeansDSVD() throws Exception {
>>>>>>>     DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>>>>     Path output = getTestTempDirPath("output");
>>>>>>>     Path tmp = getTestTempDirPath("tmp");
>>>>>>>     Path eigenvectors = new Path(output, "eigenvectors");
>>>>>>>     int desiredRank = 13;
>>>>>>>     DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>>>>     Configuration config = new Configuration();
>>>>>>>     solver.setConf(config);
>>>>>>>     Path testData = getTestTempDirPath("testdata");
>>>>>>>     int sampleDimension = sampleData.get(0).get().size();
>>>>>>>     solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>>>>
>>>>>>>   sampleDimension, false, desiredRank);
>>>>>>      // now multiply the testdata matrix and the eigenvector matrix
>>>>>>>     DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>>>>
>>>>>>>   tmp, desiredRank - 1, sampleDimension);
>>>>>>      JobConf conf = new JobConf(config);
>>>>>>>     svdT.configure(conf);
>>>>>>>     DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>>>>
>>>>>>>   sampleData.size(), sampleDimension);
>>>>>>      a.configure(conf);
>>>>>>>     DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>>>>     sData.configure(conf);
>>>>>>>
>>>>>>>     // now run the Canopy job to prime kMeans canopies
>>>>>>>     CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
>>>>>>> false,
>>>>>>>
>>>>>>>   false);
>>>>>>      // now run the KMeans job
>>>>>>>     KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>>>>
>>>>>>>   "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>>>>      // run ClusterDumper
>>>>>>>     ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>>>>
>>>>>>>   "clusters-2"), new Path(output, "clusteredPoints"));
>>>>>>      clusterDumper.printClusters(termDictionary);
>>>>>>>   }
>>>>>>>
>>>>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>>>>
>>>>>>>   Looking at the single unit test of DMR.times() it seems to be
>>>>>>>>   implementing Matrix expected = inputA.transpose().times(inputB), and
>>>>>>> not
>>>>>>>
>>>>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>>>>> implemented. But the method still has the wrong name and AFAICT is not
>>>>>> useful for performing this particular computation. Should I use this
>>>>>> instead?
>>>>>>
>>>>>>   DistributedRowMatrix sData =
>>>>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>>>>> ugh! And it still fails with:
>>>>>>>
>>>>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable
>>>>>>>> cannot
>>>>>>>>
>>>>>>>>   be cast to org.apache.hadoop.io.IntWritable
>>>>>>>     at
>>>>>>>
>>>>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>>>
>>>>>>      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>>>     at
>>>>>>>>
>>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>>>>> 7-8
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Using SVD with Canopy/KMeans

Posted by Jake Mannix <ja...@gmail.com>.

I really should have chimed in in this thread earlier... sorry folks.

  Regarding the relation between eigenvector decomposition and SVD:  The
Lanczos algorithm is specifically for eigenvector decomposition (and hence
works on symmetric square matrices), and operates by looking at the set of
vectors { v, Av, A(Av), A(A(Av))), ... } and doing some clever recombination
on them.  To adapt this to do SVD, you just substitute A = C' * C, for any
(non-square) matrix C.  You get a set of eigenvectors of C' * C, which in
turn are the right-singular vectors of C (what is called V, if C = U D V').
 In particular, if you are representing C as rows of vectors of
dimensionality numColumns, then the singular vectors you get out are also of
dimensionality numColumns (i.e. they live in the same vector space as your
"documents", because they are mixtures of them.

  Regarding DistributedMatrix.times(DistributedMatrix other), yes, it should
be called transposeTimes(), but unfortunately, timesTranspose is *not* that
efficient, in terms of Map-Reduce passes.  If you can come up with a way to
do a single MR pass to compute timesTranspose(), or simply times(), by all
means, add it!  But the only matrix multiplication I could do on two large
sparse matrices in one pass over the data was what is currently in there.

  Mahout DistributedRowMatrix does *not* currently have the very easy
"multiply this matrix by a diagonal matrix", or more generally "multiply by
this matrix which is so small as to be able to fit in memory in all of the
mappers".  That would be a very helpful addition.

  Regarding correctness of the matrix multiplication: isn't there, right
now, a unit test comparing DistributedRowMatrix.times() to
DenseMatrix.transpose().times(), and checks that the results are the same
for small matrices?  It should be in TestDistributedRowMatrix or
something...

  As for outputting eigenvalues (or singular values),
DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
eigenvalues aren't valid, and also scales all of the eigenvalues down by the
max eignenvalue (to avoid floating point overflow), and so the step which
spits out the nice correctly scaled (and non-spurious) eigenvector/value
pairs is that done by the "cleaneigens" shell script step (c.f.
EigenVerificationJob).  This is when EigenVector instances are created.

  It would make sense for EigenVerificationJob to be folded into
DistributedLanczosSolver, to reduce confusion.  Maybe add a solver flag on
whether you care about the "cleaning" step.

  One more note:  if you really want to get the projection of your original
matrix's rows onto the SVD basis (a common desire for dimensionality
reduction), then what you want is not V, but U' : U is a set of k vectors,
each of which having numRows (i.e. "numDocuments" / "numUsers")
dimensionality, but what you want is numRows vectors, each of dimension k
(these numRows rows have had their dimensionality reduced from sparse
numColumns size down to dense k), which is U'.

  To get U', instead of V, just do DistributedRowMatrix.transpose() on your
*original* matrix before feeding it to the DistributedLanczosSolver.  You'll
get U, and to get U', just take one more transpose of the
DistributedRowMatrix represented by the k-eigenvectors (possibly you want to
do the "multiply by D^-1 or D" step here too):

  C = U D V'  --> C' C = V D U' U D V' = V D^2 V', and
  C' = V D U' --> C C' = U D V' V D U' = U D^2 U'

  Overall, however, this only saves you one map reduce pass, because the
original way of getting U' was to: take SVD (k MR passes), then transpose
both the output (1 MR pass) and input (1 MR pass), and then multiply them
together (1 MR pass) = k+3 MR passes.  The new way is: take transpose (1 MR
pass), then take SVD (k MR passes), then transpose the output (1 MR pass) =
k + 2 MR passes.

  Of course, it all depends on what you need.  I've often wanted to hang
onto the V-vectors, because they are useful for "folding-in" new data into a
larger approximate SVD, and they already have the form of "pseudo-documents"
and can give you a feel for how you've clustered your input features.

  -jake

On Sat, Sep 11, 2010 at 6:18 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  +1 Ok, this makes so much sense now. It had been bothering me that U D
> would not have the same number of rows as A in the non-square case - our
> typical. It's interesting that the
> DistributedLanczosSolver.serializeOutput() method does not actually
> serialize the eigenvalues. I've fixed that locally using NamedVector but
> something like WeightedVectorWritable (not itself a Vector) would be better.
> There's also EigenVector objects which might be used here.
>
>
>
> On 9/11/10 3:57 PM, Ted Dunning wrote:
>
>> Well, firstly the code is doing a singular value decomposition.  This is
>> similar to an eigenvector decomposition except that there is no assumption
>> that A is square.  With the SVD, we have
>>
>>      A = U D V'
>>
>> This is in contrast to the eigenvector decomposition which has
>>
>>     A = U D U'
>>
>> The difference lies in having both left and right singular vectors, U and
>> V.
>>
>> Practically speaking, with the SVD decomposition, we can consider A or any
>> other appropriate matrix to be either a pile of row vectors or a bunch of
>> column vectors and we can project either into the reduced dimensional
>> representation using U' on the left or V on the right.  In Mahout, the SVD
>> only gives us one of U or V (I forget which).  Most commonly, I consider A
>> to be row vectors.  This means that I can use the fact that V' V = I to do
>> this:
>>
>>      A V = U D
>>
>> Now, if I wanted to do this some new matrix B, I would need to keep V
>> around
>> in order to do the multiplication.  But since I just did the SVD of A, all
>> I
>> need is U and D which is what I think that the Mahout eigen-solver gives
>> us.
>>  As you say, multiplying by D is trivial because it is diagonal.
>>
>> Does that make sense?
>>
>> This is similar to what you said, but differs in some details.
>>
>> Note also that it is common to split D into two factors:
>>
>>     A = U sqrt(D) sqrt(D) V'
>>
>> and then use this for the decomposition of A
>>
>>     A V sqrt(D)^-1
>>
>> The multipication by the inverse of sqrt(D) is trivial since D is
>> diagonal.
>>
>>
>> On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<jdog@windwardsolutions.com
>> >wrote:
>>
>>   Ted,
>>>
>>> Is this because, for any matrix A, its eigenvector matrix P and its
>>> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
>>> full matrix multiplication since it is diagonal? Could you please
>>> elaborate?
>>>
>>> Jeff
>>>
>>>
>>>
>>> On 9/11/10 2:50 PM, Ted Dunning wrote:
>>>
>>>  Should be close.  The matrixMult step may be redundant if you want to
>>>> cluster the same data that you decomposed.  That would make the second
>>>> transpose unnecessary as well.
>>>>
>>>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<gsingers@apache.org
>>>>
>>>>> wrote:
>>>>>
>>>>  To put this in bin/mahout speak, this would look like, munging some
>>>> names
>>>>
>>>>> and taking liberties with the actual argument to be passed in:
>>>>>
>>>>> bin/mahout svd (original ->   svdOut)
>>>>> bin/mahout cleansvd ...
>>>>> bin/mahout transpose svdOut ->   svdT
>>>>> bin/mahout transpose original ->   originalT
>>>>> bin/mahout matrixmult originalT svdT ->   newMatrix
>>>>> bin/mahout kmeans newMatrix
>>>>>
>>>>> Is that about right?
>>>>>
>>>>>
>>>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>>>
>>>>>  Ok, the transposed computation seems to work and the cast exception
>>>>> was
>>>>> caused by my unit test writing LongWritable keys to the testdata file.
>>>>> The
>>>>> following test produces a comparable answer to the non-distributed
>>>>> case.
>>>>> I
>>>>> still want to rename the method to transposeTimes for clarity. And
>>>>> better,
>>>>> implement timesTranspose to make this particular computation more
>>>>> efficient:
>>>>>
>>>>>   public void testKmeansDSVD() throws Exception {
>>>>>>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>>>    Path output = getTestTempDirPath("output");
>>>>>>    Path tmp = getTestTempDirPath("tmp");
>>>>>>    Path eigenvectors = new Path(output, "eigenvectors");
>>>>>>    int desiredRank = 13;
>>>>>>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>>>    Configuration config = new Configuration();
>>>>>>    solver.setConf(config);
>>>>>>    Path testData = getTestTempDirPath("testdata");
>>>>>>    int sampleDimension = sampleData.get(0).get().size();
>>>>>>    solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>>>
>>>>>>  sampleDimension, false, desiredRank);
>>>>>
>>>>>     // now multiply the testdata matrix and the eigenvector matrix
>>>>>>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>>>
>>>>>>  tmp, desiredRank - 1, sampleDimension);
>>>>>
>>>>>     JobConf conf = new JobConf(config);
>>>>>>    svdT.configure(conf);
>>>>>>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>>>
>>>>>>  sampleData.size(), sampleDimension);
>>>>>
>>>>>     a.configure(conf);
>>>>>>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>>>    sData.configure(conf);
>>>>>>
>>>>>>    // now run the Canopy job to prime kMeans canopies
>>>>>>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4,
>>>>>> false,
>>>>>>
>>>>>>  false);
>>>>>
>>>>>     // now run the KMeans job
>>>>>>    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>>>
>>>>>>  "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>>>
>>>>>     // run ClusterDumper
>>>>>>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>>>
>>>>>>  "clusters-2"), new Path(output, "clusteredPoints"));
>>>>>
>>>>>     clusterDumper.printClusters(termDictionary);
>>>>>>  }
>>>>>>
>>>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>>>
>>>>>>  Looking at the single unit test of DMR.times() it seems to be
>>>>>>>
>>>>>>>  implementing Matrix expected = inputA.transpose().times(inputB), and
>>>>>> not
>>>>>>
>>>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>>>> implemented. But the method still has the wrong name and AFAICT is not
>>>>> useful for performing this particular computation. Should I use this
>>>>> instead?
>>>>>
>>>>>  DistributedRowMatrix sData =
>>>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>>>> ugh! And it still fails with:
>>>>>>
>>>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable
>>>>>>> cannot
>>>>>>>
>>>>>>>  be cast to org.apache.hadoop.io.IntWritable
>>>>>>    at
>>>>>>
>>>>>
>>>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>>
>>>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>
>>>>>>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>>>> 7-8
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  +1 Ok, this makes so much sense now. It had been bothering me that U D 
would not have the same number of rows as A in the non-square case - our 
typical. It's interesting that the 
DistributedLanczosSolver.serializeOutput() method does not actually 
serialize the eigenvalues. I've fixed that locally using NamedVector but 
something like WeightedVectorWritable (not itself a Vector) would be 
better. There's also EigenVector objects which might be used here.


On 9/11/10 3:57 PM, Ted Dunning wrote:
> Well, firstly the code is doing a singular value decomposition.  This is
> similar to an eigenvector decomposition except that there is no assumption
> that A is square.  With the SVD, we have
>
>       A = U D V'
>
> This is in contrast to the eigenvector decomposition which has
>
>      A = U D U'
>
> The difference lies in having both left and right singular vectors, U and V.
>
> Practically speaking, with the SVD decomposition, we can consider A or any
> other appropriate matrix to be either a pile of row vectors or a bunch of
> column vectors and we can project either into the reduced dimensional
> representation using U' on the left or V on the right.  In Mahout, the SVD
> only gives us one of U or V (I forget which).  Most commonly, I consider A
> to be row vectors.  This means that I can use the fact that V' V = I to do
> this:
>
>       A V = U D
>
> Now, if I wanted to do this some new matrix B, I would need to keep V around
> in order to do the multiplication.  But since I just did the SVD of A, all I
> need is U and D which is what I think that the Mahout eigen-solver gives us.
>   As you say, multiplying by D is trivial because it is diagonal.
>
> Does that make sense?
>
> This is similar to what you said, but differs in some details.
>
> Note also that it is common to split D into two factors:
>
>      A = U sqrt(D) sqrt(D) V'
>
> and then use this for the decomposition of A
>
>      A V sqrt(D)^-1
>
> The multipication by the inverse of sqrt(D) is trivial since D is diagonal.
>
>
> On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>>   Ted,
>>
>> Is this because, for any matrix A, its eigenvector matrix P and its
>> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
>> full matrix multiplication since it is diagonal? Could you please elaborate?
>>
>> Jeff
>>
>>
>>
>> On 9/11/10 2:50 PM, Ted Dunning wrote:
>>
>>> Should be close.  The matrixMult step may be redundant if you want to
>>> cluster the same data that you decomposed.  That would make the second
>>> transpose unnecessary as well.
>>>
>>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<gsingers@apache.org
>>>> wrote:
>>>   To put this in bin/mahout speak, this would look like, munging some names
>>>> and taking liberties with the actual argument to be passed in:
>>>>
>>>> bin/mahout svd (original ->   svdOut)
>>>> bin/mahout cleansvd ...
>>>> bin/mahout transpose svdOut ->   svdT
>>>> bin/mahout transpose original ->   originalT
>>>> bin/mahout matrixmult originalT svdT ->   newMatrix
>>>> bin/mahout kmeans newMatrix
>>>>
>>>> Is that about right?
>>>>
>>>>
>>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>>
>>>>   Ok, the transposed computation seems to work and the cast exception was
>>>> caused by my unit test writing LongWritable keys to the testdata file.
>>>> The
>>>> following test produces a comparable answer to the non-distributed case.
>>>> I
>>>> still want to rename the method to transposeTimes for clarity. And
>>>> better,
>>>> implement timesTranspose to make this particular computation more
>>>> efficient:
>>>>
>>>>>   public void testKmeansDSVD() throws Exception {
>>>>>     DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>>     Path output = getTestTempDirPath("output");
>>>>>     Path tmp = getTestTempDirPath("tmp");
>>>>>     Path eigenvectors = new Path(output, "eigenvectors");
>>>>>     int desiredRank = 13;
>>>>>     DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>>     Configuration config = new Configuration();
>>>>>     solver.setConf(config);
>>>>>     Path testData = getTestTempDirPath("testdata");
>>>>>     int sampleDimension = sampleData.get(0).get().size();
>>>>>     solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>>
>>>> sampleDimension, false, desiredRank);
>>>>
>>>>>     // now multiply the testdata matrix and the eigenvector matrix
>>>>>     DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>>
>>>> tmp, desiredRank - 1, sampleDimension);
>>>>
>>>>>     JobConf conf = new JobConf(config);
>>>>>     svdT.configure(conf);
>>>>>     DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>>
>>>> sampleData.size(), sampleDimension);
>>>>
>>>>>     a.configure(conf);
>>>>>     DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>>     sData.configure(conf);
>>>>>
>>>>>     // now run the Canopy job to prime kMeans canopies
>>>>>     CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>>>>>
>>>> false);
>>>>
>>>>>     // now run the KMeans job
>>>>>     KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>>
>>>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>>
>>>>>     // run ClusterDumper
>>>>>     ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>>
>>>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>>
>>>>>     clusterDumper.printClusters(termDictionary);
>>>>>   }
>>>>>
>>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>>
>>>>>> Looking at the single unit test of DMR.times() it seems to be
>>>>>>
>>>>> implementing Matrix expected = inputA.transpose().times(inputB), and not
>>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>>> implemented. But the method still has the wrong name and AFAICT is not
>>>> useful for performing this particular computation. Should I use this
>>>> instead?
>>>>
>>>>> DistributedRowMatrix sData =
>>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>>> ugh! And it still fails with:
>>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>>>>>>
>>>>> be cast to org.apache.hadoop.io.IntWritable
>>>>>     at
>>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>
>>>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>     at
>>>>>>
>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>>> 7-8
>>>>
>>>>
>>>>

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

Well, firstly the code is doing a singular value decomposition.  This is
similar to an eigenvector decomposition except that there is no assumption
that A is square.  With the SVD, we have

     A = U D V'

This is in contrast to the eigenvector decomposition which has

    A = U D U'

The difference lies in having both left and right singular vectors, U and V.

Practically speaking, with the SVD decomposition, we can consider A or any
other appropriate matrix to be either a pile of row vectors or a bunch of
column vectors and we can project either into the reduced dimensional
representation using U' on the left or V on the right.  In Mahout, the SVD
only gives us one of U or V (I forget which).  Most commonly, I consider A
to be row vectors.  This means that I can use the fact that V' V = I to do
this:

     A V = U D

Now, if I wanted to do this some new matrix B, I would need to keep V around
in order to do the multiplication.  But since I just did the SVD of A, all I
need is U and D which is what I think that the Mahout eigen-solver gives us.
 As you say, multiplying by D is trivial because it is diagonal.

Does that make sense?

This is similar to what you said, but differs in some details.

Note also that it is common to split D into two factors:

    A = U sqrt(D) sqrt(D) V'

and then use this for the decomposition of A

    A V sqrt(D)^-1

The multipication by the inverse of sqrt(D) is trivial since D is diagonal.


On Sat, Sep 11, 2010 at 3:35 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  Ted,
>
> Is this because, for any matrix A, its eigenvector matrix P and its
> diagonal eigenvalue matrix D, that A P = P D? And P D does not require a
> full matrix multiplication since it is diagonal? Could you please elaborate?
>
> Jeff
>
>
>
> On 9/11/10 2:50 PM, Ted Dunning wrote:
>
>> Should be close.  The matrixMult step may be redundant if you want to
>> cluster the same data that you decomposed.  That would make the second
>> transpose unnecessary as well.
>>
>> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<gsingers@apache.org
>> >wrote:
>>
>>  To put this in bin/mahout speak, this would look like, munging some names
>>> and taking liberties with the actual argument to be passed in:
>>>
>>> bin/mahout svd (original ->  svdOut)
>>> bin/mahout cleansvd ...
>>> bin/mahout transpose svdOut ->  svdT
>>> bin/mahout transpose original ->  originalT
>>> bin/mahout matrixmult originalT svdT ->  newMatrix
>>> bin/mahout kmeans newMatrix
>>>
>>> Is that about right?
>>>
>>>
>>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>>
>>>  Ok, the transposed computation seems to work and the cast exception was
>>>>
>>> caused by my unit test writing LongWritable keys to the testdata file.
>>> The
>>> following test produces a comparable answer to the non-distributed case.
>>> I
>>> still want to rename the method to transposeTimes for clarity. And
>>> better,
>>> implement timesTranspose to make this particular computation more
>>> efficient:
>>>
>>>>  public void testKmeansDSVD() throws Exception {
>>>>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>>    Path output = getTestTempDirPath("output");
>>>>    Path tmp = getTestTempDirPath("tmp");
>>>>    Path eigenvectors = new Path(output, "eigenvectors");
>>>>    int desiredRank = 13;
>>>>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>>    Configuration config = new Configuration();
>>>>    solver.setConf(config);
>>>>    Path testData = getTestTempDirPath("testdata");
>>>>    int sampleDimension = sampleData.get(0).get().size();
>>>>    solver.run(testData, tmp, eigenvectors, sampleData.size(),
>>>>
>>> sampleDimension, false, desiredRank);
>>>
>>>>    // now multiply the testdata matrix and the eigenvector matrix
>>>>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>>>>
>>> tmp, desiredRank - 1, sampleDimension);
>>>
>>>>    JobConf conf = new JobConf(config);
>>>>    svdT.configure(conf);
>>>>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>>>>
>>> sampleData.size(), sampleDimension);
>>>
>>>>    a.configure(conf);
>>>>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>>    sData.configure(conf);
>>>>
>>>>    // now run the Canopy job to prime kMeans canopies
>>>>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>>>>
>>> false);
>>>
>>>>    // now run the KMeans job
>>>>    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>>>>
>>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>
>>>>    // run ClusterDumper
>>>>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>>>>
>>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>
>>>>    clusterDumper.printClusters(termDictionary);
>>>>  }
>>>>
>>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>>
>>>>> Looking at the single unit test of DMR.times() it seems to be
>>>>>
>>>> implementing Matrix expected = inputA.transpose().times(inputB), and not
>>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>>> implemented. But the method still has the wrong name and AFAICT is not
>>> useful for performing this particular computation. Should I use this
>>> instead?
>>>
>>>> DistributedRowMatrix sData =
>>>>>
>>>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>
>>>> ugh! And it still fails with:
>>>>>
>>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>>>>>
>>>> be cast to org.apache.hadoop.io.IntWritable
>>>
>>>>    at
>>>>>
>>>>
>>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>
>>>>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>    at
>>>>>
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>> 7-8
>>>
>>>
>>>
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Ted,

Is this because, for any matrix A, its eigenvector matrix P and its 
diagonal eigenvalue matrix D, that A P = P D? And P D does not require a 
full matrix multiplication since it is diagonal? Could you please elaborate?

Jeff


On 9/11/10 2:50 PM, Ted Dunning wrote:
> Should be close.  The matrixMult step may be redundant if you want to
> cluster the same data that you decomposed.  That would make the second
> transpose unnecessary as well.
>
> On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll<gs...@apache.org>wrote:
>
>> To put this in bin/mahout speak, this would look like, munging some names
>> and taking liberties with the actual argument to be passed in:
>>
>> bin/mahout svd (original ->  svdOut)
>> bin/mahout cleansvd ...
>> bin/mahout transpose svdOut ->  svdT
>> bin/mahout transpose original ->  originalT
>> bin/mahout matrixmult originalT svdT ->  newMatrix
>> bin/mahout kmeans newMatrix
>>
>> Is that about right?
>>
>>
>> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>>
>>> Ok, the transposed computation seems to work and the cast exception was
>> caused by my unit test writing LongWritable keys to the testdata file. The
>> following test produces a comparable answer to the non-distributed case. I
>> still want to rename the method to transposeTimes for clarity. And better,
>> implement timesTranspose to make this particular computation more efficient:
>>>   public void testKmeansDSVD() throws Exception {
>>>     DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>     Path output = getTestTempDirPath("output");
>>>     Path tmp = getTestTempDirPath("tmp");
>>>     Path eigenvectors = new Path(output, "eigenvectors");
>>>     int desiredRank = 13;
>>>     DistributedLanczosSolver solver = new DistributedLanczosSolver();
>>>     Configuration config = new Configuration();
>>>     solver.setConf(config);
>>>     Path testData = getTestTempDirPath("testdata");
>>>     int sampleDimension = sampleData.get(0).get().size();
>>>     solver.run(testData, tmp, eigenvectors, sampleData.size(),
>> sampleDimension, false, desiredRank);
>>>     // now multiply the testdata matrix and the eigenvector matrix
>>>     DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
>> tmp, desiredRank - 1, sampleDimension);
>>>     JobConf conf = new JobConf(config);
>>>     svdT.configure(conf);
>>>     DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
>> sampleData.size(), sampleDimension);
>>>     a.configure(conf);
>>>     DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>>>     sData.configure(conf);
>>>
>>>     // now run the Canopy job to prime kMeans canopies
>>>     CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
>> false);
>>>     // now run the KMeans job
>>>     KMeansDriver.runJob(sData.getRowPath(), new Path(output,
>> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>>>     // run ClusterDumper
>>>     ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
>> "clusters-2"), new Path(output, "clusteredPoints"));
>>>     clusterDumper.printClusters(termDictionary);
>>>   }
>>>
>>> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>>>> Looking at the single unit test of DMR.times() it seems to be
>> implementing Matrix expected = inputA.transpose().times(inputB), and not
>> inputA.times(inputB.transpose()), so the bounds checking is correct as
>> implemented. But the method still has the wrong name and AFAICT is not
>> useful for performing this particular computation. Should I use this
>> instead?
>>>> DistributedRowMatrix sData =
>> a.transpose().t[ransposeT]imes(svdT.transpose())
>>>> ugh! And it still fails with:
>>>>
>>>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
>> be cast to org.apache.hadoop.io.IntWritable
>>>>     at
>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>>
>>

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

Should be close.  The matrixMult step may be redundant if you want to
cluster the same data that you decomposed.  That would make the second
transpose unnecessary as well.

On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <gs...@apache.org>wrote:

> To put this in bin/mahout speak, this would look like, munging some names
> and taking liberties with the actual argument to be passed in:
>
> bin/mahout svd (original -> svdOut)
> bin/mahout cleansvd ...
> bin/mahout transpose svdOut -> svdT
> bin/mahout transpose original -> originalT
> bin/mahout matrixmult originalT svdT -> newMatrix
> bin/mahout kmeans newMatrix
>
> Is that about right?
>
>
> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>
> > Ok, the transposed computation seems to work and the cast exception was
> caused by my unit test writing LongWritable keys to the testdata file. The
> following test produces a comparable answer to the non-distributed case. I
> still want to rename the method to transposeTimes for clarity. And better,
> implement timesTranspose to make this particular computation more efficient:
> >
> >  public void testKmeansDSVD() throws Exception {
> >    DistanceMeasure measure = new EuclideanDistanceMeasure();
> >    Path output = getTestTempDirPath("output");
> >    Path tmp = getTestTempDirPath("tmp");
> >    Path eigenvectors = new Path(output, "eigenvectors");
> >    int desiredRank = 13;
> >    DistributedLanczosSolver solver = new DistributedLanczosSolver();
> >    Configuration config = new Configuration();
> >    solver.setConf(config);
> >    Path testData = getTestTempDirPath("testdata");
> >    int sampleDimension = sampleData.get(0).get().size();
> >    solver.run(testData, tmp, eigenvectors, sampleData.size(),
> sampleDimension, false, desiredRank);
> >
> >    // now multiply the testdata matrix and the eigenvector matrix
> >    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
> tmp, desiredRank - 1, sampleDimension);
> >    JobConf conf = new JobConf(config);
> >    svdT.configure(conf);
> >    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
> sampleData.size(), sampleDimension);
> >    a.configure(conf);
> >    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >    sData.configure(conf);
> >
> >    // now run the Canopy job to prime kMeans canopies
> >    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> false);
> >    // now run the KMeans job
> >    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
> >    // run ClusterDumper
> >    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> "clusters-2"), new Path(output, "clusteredPoints"));
> >    clusterDumper.printClusters(termDictionary);
> >  }
> >
> > On 9/3/10 7:54 AM, Jeff Eastman wrote:
> >> Looking at the single unit test of DMR.times() it seems to be
> implementing Matrix expected = inputA.transpose().times(inputB), and not
> inputA.times(inputB.transpose()), so the bounds checking is correct as
> implemented. But the method still has the wrong name and AFAICT is not
> useful for performing this particular computation. Should I use this
> instead?
> >>
> >> DistributedRowMatrix sData =
> a.transpose().t[ransposeT]imes(svdT.transpose())
> >>
> >> ugh! And it still fails with:
> >>
> >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> be cast to org.apache.hadoop.io.IntWritable
> >>    at
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> >>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

Should be close.  The matrixMult step may be redundant if you want to
cluster the same data that you decomposed.  That would make the second
transpose unnecessary as well.

On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <gs...@apache.org>wrote:

> To put this in bin/mahout speak, this would look like, munging some names
> and taking liberties with the actual argument to be passed in:
>
> bin/mahout svd (original -> svdOut)
> bin/mahout cleansvd ...
> bin/mahout transpose svdOut -> svdT
> bin/mahout transpose original -> originalT
> bin/mahout matrixmult originalT svdT -> newMatrix
> bin/mahout kmeans newMatrix
>
> Is that about right?
>
>
> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>
> > Ok, the transposed computation seems to work and the cast exception was
> caused by my unit test writing LongWritable keys to the testdata file. The
> following test produces a comparable answer to the non-distributed case. I
> still want to rename the method to transposeTimes for clarity. And better,
> implement timesTranspose to make this particular computation more efficient:
> >
> >  public void testKmeansDSVD() throws Exception {
> >    DistanceMeasure measure = new EuclideanDistanceMeasure();
> >    Path output = getTestTempDirPath("output");
> >    Path tmp = getTestTempDirPath("tmp");
> >    Path eigenvectors = new Path(output, "eigenvectors");
> >    int desiredRank = 13;
> >    DistributedLanczosSolver solver = new DistributedLanczosSolver();
> >    Configuration config = new Configuration();
> >    solver.setConf(config);
> >    Path testData = getTestTempDirPath("testdata");
> >    int sampleDimension = sampleData.get(0).get().size();
> >    solver.run(testData, tmp, eigenvectors, sampleData.size(),
> sampleDimension, false, desiredRank);
> >
> >    // now multiply the testdata matrix and the eigenvector matrix
> >    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
> tmp, desiredRank - 1, sampleDimension);
> >    JobConf conf = new JobConf(config);
> >    svdT.configure(conf);
> >    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
> sampleData.size(), sampleDimension);
> >    a.configure(conf);
> >    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >    sData.configure(conf);
> >
> >    // now run the Canopy job to prime kMeans canopies
> >    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> false);
> >    // now run the KMeans job
> >    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
> >    // run ClusterDumper
> >    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> "clusters-2"), new Path(output, "clusteredPoints"));
> >    clusterDumper.printClusters(termDictionary);
> >  }
> >
> > On 9/3/10 7:54 AM, Jeff Eastman wrote:
> >> Looking at the single unit test of DMR.times() it seems to be
> implementing Matrix expected = inputA.transpose().times(inputB), and not
> inputA.times(inputB.transpose()), so the bounds checking is correct as
> implemented. But the method still has the wrong name and AFAICT is not
> useful for performing this particular computation. Should I use this
> instead?
> >>
> >> DistributedRowMatrix sData =
> a.transpose().t[ransposeT]imes(svdT.transpose())
> >>
> >> ugh! And it still fails with:
> >>
> >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> be cast to org.apache.hadoop.io.IntWritable
> >>    at
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> >>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Using SVD with Canopy/KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:

bin/mahout svd (original -> svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut -> svdT
bin/mahout transpose original -> originalT
bin/mahout matrixmult originalT svdT -> newMatrix
bin/mahout kmeans newMatrix

Is that about right?


On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:

> Ok, the transposed computation seems to work and the cast exception was caused by my unit test writing LongWritable keys to the testdata file. The following test produces a comparable answer to the non-distributed case. I still want to rename the method to transposeTimes for clarity. And better, implement timesTranspose to make this particular computation more efficient:
> 
>  public void testKmeansDSVD() throws Exception {
>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>    Path output = getTestTempDirPath("output");
>    Path tmp = getTestTempDirPath("tmp");
>    Path eigenvectors = new Path(output, "eigenvectors");
>    int desiredRank = 13;
>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>    Configuration config = new Configuration();
>    solver.setConf(config);
>    Path testData = getTestTempDirPath("testdata");
>    int sampleDimension = sampleData.get(0).get().size();
>    solver.run(testData, tmp, eigenvectors, sampleData.size(), sampleDimension, false, desiredRank);
> 
>    // now multiply the testdata matrix and the eigenvector matrix
>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, tmp, desiredRank - 1, sampleDimension);
>    JobConf conf = new JobConf(config);
>    svdT.configure(conf);
>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, sampleData.size(), sampleDimension);
>    a.configure(conf);
>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>    sData.configure(conf);
> 
>    // now run the Canopy job to prime kMeans canopies
>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false, false);
>    // now run the KMeans job
>    KMeansDriver.runJob(sData.getRowPath(), new Path(output, "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>    // run ClusterDumper
>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-2"), new Path(output, "clusteredPoints"));
>    clusterDumper.printClusters(termDictionary);
>  }
> 
> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>> Looking at the single unit test of DMR.times() it seems to be implementing Matrix expected = inputA.transpose().times(inputB), and not inputA.times(inputB.transpose()), so the bounds checking is correct as implemented. But the method still has the wrong name and AFAICT is not useful for performing this particular computation. Should I use this instead?
>> 
>> DistributedRowMatrix sData = a.transpose().t[ransposeT]imes(svdT.transpose())
>> 
>> ugh! And it still fails with:
>> 
>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
>>    at org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:

bin/mahout svd (original -> svdOut)
bin/mahout cleansvd ...
bin/mahout transpose svdOut -> svdT
bin/mahout transpose original -> originalT
bin/mahout matrixmult originalT svdT -> newMatrix
bin/mahout kmeans newMatrix

Is that about right?


On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:

> Ok, the transposed computation seems to work and the cast exception was caused by my unit test writing LongWritable keys to the testdata file. The following test produces a comparable answer to the non-distributed case. I still want to rename the method to transposeTimes for clarity. And better, implement timesTranspose to make this particular computation more efficient:
> 
>  public void testKmeansDSVD() throws Exception {
>    DistanceMeasure measure = new EuclideanDistanceMeasure();
>    Path output = getTestTempDirPath("output");
>    Path tmp = getTestTempDirPath("tmp");
>    Path eigenvectors = new Path(output, "eigenvectors");
>    int desiredRank = 13;
>    DistributedLanczosSolver solver = new DistributedLanczosSolver();
>    Configuration config = new Configuration();
>    solver.setConf(config);
>    Path testData = getTestTempDirPath("testdata");
>    int sampleDimension = sampleData.get(0).get().size();
>    solver.run(testData, tmp, eigenvectors, sampleData.size(), sampleDimension, false, desiredRank);
> 
>    // now multiply the testdata matrix and the eigenvector matrix
>    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, tmp, desiredRank - 1, sampleDimension);
>    JobConf conf = new JobConf(config);
>    svdT.configure(conf);
>    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, sampleData.size(), sampleDimension);
>    a.configure(conf);
>    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
>    sData.configure(conf);
> 
>    // now run the Canopy job to prime kMeans canopies
>    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false, false);
>    // now run the KMeans job
>    KMeansDriver.runJob(sData.getRowPath(), new Path(output, "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>    // run ClusterDumper
>    ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-2"), new Path(output, "clusteredPoints"));
>    clusterDumper.printClusters(termDictionary);
>  }
> 
> On 9/3/10 7:54 AM, Jeff Eastman wrote:
>> Looking at the single unit test of DMR.times() it seems to be implementing Matrix expected = inputA.transpose().times(inputB), and not inputA.times(inputB.transpose()), so the bounds checking is correct as implemented. But the method still has the wrong name and AFAICT is not useful for performing this particular computation. Should I use this instead?
>> 
>> DistributedRowMatrix sData = a.transpose().t[ransposeT]imes(svdT.transpose())
>> 
>> ugh! And it still fails with:
>> 
>> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
>>    at org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Ok, the transposed computation seems to work and the cast exception 
was caused by my unit test writing LongWritable keys to the testdata 
file. The following test produces a comparable answer to the 
non-distributed case. I still want to rename the method to 
transposeTimes for clarity. And better, implement timesTranspose to make 
this particular computation more efficient:

   public void testKmeansDSVD() throws Exception {
     DistanceMeasure measure = new EuclideanDistanceMeasure();
     Path output = getTestTempDirPath("output");
     Path tmp = getTestTempDirPath("tmp");
     Path eigenvectors = new Path(output, "eigenvectors");
     int desiredRank = 13;
     DistributedLanczosSolver solver = new DistributedLanczosSolver();
     Configuration config = new Configuration();
     solver.setConf(config);
     Path testData = getTestTempDirPath("testdata");
     int sampleDimension = sampleData.get(0).get().size();
     solver.run(testData, tmp, eigenvectors, sampleData.size(), 
sampleDimension, false, desiredRank);

     // now multiply the testdata matrix and the eigenvector matrix
     DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, 
tmp, desiredRank - 1, sampleDimension);
     JobConf conf = new JobConf(config);
     svdT.configure(conf);
     DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, 
sampleData.size(), sampleDimension);
     a.configure(conf);
     DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
     sData.configure(conf);

     // now run the Canopy job to prime kMeans canopies
     CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, 
false, false);
     // now run the KMeans job
     KMeansDriver.runJob(sData.getRowPath(), new Path(output, 
"clusters-0"), output, measure, 0.001, 10, 1, true, false);
     // run ClusterDumper
     ClusterDumper clusterDumper = new ClusterDumper(new Path(output, 
"clusters-2"), new Path(output, "clusteredPoints"));
     clusterDumper.printClusters(termDictionary);
   }

On 9/3/10 7:54 AM, Jeff Eastman wrote:
>  Looking at the single unit test of DMR.times() it seems to be 
> implementing Matrix expected = inputA.transpose().times(inputB), and 
> not inputA.times(inputB.transpose()), so the bounds checking is 
> correct as implemented. But the method still has the wrong name and 
> AFAICT is not useful for performing this particular computation. 
> Should I use this instead?
>
> DistributedRowMatrix sData = 
> a.transpose().t[ransposeT]imes(svdT.transpose())
>
> ugh! And it still fails with:
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot 
> be cast to org.apache.hadoop.io.IntWritable
>     at 
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Ok, the transposed computation seems to work and the cast exception 
was caused by my unit test writing LongWritable keys to the testdata 
file. The following test produces a comparable answer to the 
non-distributed case. I still want to rename the method to 
transposeTimes for clarity. And better, implement timesTranspose to make 
this particular computation more efficient:

   public void testKmeansDSVD() throws Exception {
     DistanceMeasure measure = new EuclideanDistanceMeasure();
     Path output = getTestTempDirPath("output");
     Path tmp = getTestTempDirPath("tmp");
     Path eigenvectors = new Path(output, "eigenvectors");
     int desiredRank = 13;
     DistributedLanczosSolver solver = new DistributedLanczosSolver();
     Configuration config = new Configuration();
     solver.setConf(config);
     Path testData = getTestTempDirPath("testdata");
     int sampleDimension = sampleData.get(0).get().size();
     solver.run(testData, tmp, eigenvectors, sampleData.size(), 
sampleDimension, false, desiredRank);

     // now multiply the testdata matrix and the eigenvector matrix
     DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, 
tmp, desiredRank - 1, sampleDimension);
     JobConf conf = new JobConf(config);
     svdT.configure(conf);
     DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, 
sampleData.size(), sampleDimension);
     a.configure(conf);
     DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
     sData.configure(conf);

     // now run the Canopy job to prime kMeans canopies
     CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, 
false, false);
     // now run the KMeans job
     KMeansDriver.runJob(sData.getRowPath(), new Path(output, 
"clusters-0"), output, measure, 0.001, 10, 1, true, false);
     // run ClusterDumper
     ClusterDumper clusterDumper = new ClusterDumper(new Path(output, 
"clusters-2"), new Path(output, "clusteredPoints"));
     clusterDumper.printClusters(termDictionary);
   }

On 9/3/10 7:54 AM, Jeff Eastman wrote:
>  Looking at the single unit test of DMR.times() it seems to be 
> implementing Matrix expected = inputA.transpose().times(inputB), and 
> not inputA.times(inputB.transpose()), so the bounds checking is 
> correct as implemented. But the method still has the wrong name and 
> AFAICT is not useful for performing this particular computation. 
> Should I use this instead?
>
> DistributedRowMatrix sData = 
> a.transpose().t[ransposeT]imes(svdT.transpose())
>
> ugh! And it still fails with:
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot 
> be cast to org.apache.hadoop.io.IntWritable
>     at 
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Looking at the single unit test of DMR.times() it seems to be 
implementing Matrix expected = inputA.transpose().times(inputB), and not 
inputA.times(inputB.transpose()), so the bounds checking is correct as 
implemented. But the method still has the wrong name and AFAICT is not 
useful for performing this particular computation. Should I use this 
instead?

DistributedRowMatrix sData = 
a.transpose().t[ransposeT]imes(svdT.transpose())

ugh! And it still fails with:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot 
be cast to org.apache.hadoop.io.IntWritable
     at 
org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

On 9/3/10 7:15 AM, Jeff Eastman wrote:
>  Ah, ok, that would explain some of the cardinality checking 
> differences. Its kind of confusing to have the times() method doing 
> something different from Matrix in the distributed case. Maybe 
> renaming to transposeTimes would be clearer? At least adding some 
> comments for the unwary?
>
> Here's my revised test using DistributedRowMatrix for the 
> computations. It fails in DMR.times() with a CardinalityException. 
> Looking at the source of the exception (again), it is comparing 
> numRows != other.numRows(). Since matrix A is [15x39] and svdT is 
> [12x39] shouldn't A.t[ransposeT]imes(svdT) be comparing the numCols 
> instead? What exactly is DMR.times() doing?
>
>   @Test
>   public void testKmeansDSVD() throws Exception {
>     DistanceMeasure measure = new EuclideanDistanceMeasure();
>     Path output = getTestTempDirPath("output");
>     Path tmp = getTestTempDirPath("tmp");
>     Path eigenvectors = new Path(output, "eigenvectors");
>     int desiredRank = 13;
>     DistributedLanczosSolver solver = new DistributedLanczosSolver();
>     Configuration config = new Configuration();
>     solver.setConf(config);
>     Path testData = getTestTempDirPath("testdata");
>     int sampleDimension = sampleData.get(0).get().size();
>     solver.run(testData, tmp, eigenvectors, sampleData.size(), 
> sampleDimension, false, desiredRank);
>
>     // now multiply the testdata matrix and the eigenvector matrix
>     DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, 
> tmp, desiredRank - 1, sampleDimension);
>     JobConf conf = new JobConf(config);
>     svdT.configure(conf);
>     DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, 
> sampleData.size(), sampleDimension);
>     a.configure(conf);
>     // DMR.times() is really transposeTimes()? Then this should work.
>     DistributedRowMatrix sData = a.times(svdT);
>     sData.configure(conf);
>
>     // now run the Canopy job to prime kMeans canopies
>     CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, 
> false, false);
>     // now run the KMeans job
>     KMeansDriver.runJob(sData.getRowPath(), new Path(output, 
> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
>     // run ClusterDumper
>     ClusterDumper clusterDumper = new ClusterDumper(new Path(output, 
> "clusters-2"), new Path(output, "clusteredPoints"));
>     clusterDumper.printClusters(termDictionary);
>   }
>
>
>
> On 9/2/10 10:10 PM, Ted Dunning wrote:
>> I think that the solver actually does an SVD, but most of what you say
>> follows.
>>
>> THere is one strangeness, I think in that the 
>> DistributedRowMatrix.times is
>> doing a transposeTimes operation, not the normal times.
>>
>> Jake should comment.
>>
>> On Thu, Sep 2, 2010 at 8:28 PM, Jeff 
>> Eastman<jd...@windwardsolutions.com>wrote:
>>
>>>   On 9/2/10 7:41 PM, Jeff Eastman wrote:
>>>
>>>>   Hopefully answering my own question here but ending up with 
>>>> another. The
>>>> svd matrix I'd built from the eigenvectors is the wrong shape as I 
>>>> built it.
>>>> Taking Jake's "column space" literally and building a matrix where 
>>>> each of
>>>> the columns is one of the eigenvectors does give a matrix of the 
>>>> correct
>>>> shape. The math works with DenseMatrix, producing a new data matrix 
>>>> which is
>>>> 15x7; a significant dimensionality reduction from 15x39.
>>>>
>>>> In this example, with 15 samples having 39 terms and 7 eigenvectors:
>>>>     A = [15x39]
>>>>     P = [39x7]
>>>>     A P = [15x7]
>>>> <snip>
>>>>
>>> Representing the eigen decomposition math in the above notation, A P 
>>> is the
>>> projection of the data set onto the eigenvector basis:
>>>
>>> If:
>>> A = original data matrix
>>> P = eigenvector column matrix
>>> D = eigenvalue diagonal matrix
>>>
>>> Then:
>>> A P = P D =>  A = P D P'
>>>
>>> Since we have A and P is already calculated by 
>>> DistributedLanczosSolver it
>>> is easy to compute A P and we don't need the eigenvalues at all. 
>>> This is
>>> good because the DLS does not output them. Is this why it doesn't 
>>> bother?
>>>
>>>
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Ah, ok, that would explain some of the cardinality checking 
differences. Its kind of confusing to have the times() method doing 
something different from Matrix in the distributed case. Maybe renaming 
to transposeTimes would be clearer? At least adding some comments for 
the unwary?

Here's my revised test using DistributedRowMatrix for the computations. 
It fails in DMR.times() with a CardinalityException. Looking at the 
source of the exception (again), it is comparing numRows != 
other.numRows(). Since matrix A is [15x39] and svdT is [12x39] shouldn't 
A.t[ransposeT]imes(svdT) be comparing the numCols instead? What exactly 
is DMR.times() doing?

   @Test
   public void testKmeansDSVD() throws Exception {
     DistanceMeasure measure = new EuclideanDistanceMeasure();
     Path output = getTestTempDirPath("output");
     Path tmp = getTestTempDirPath("tmp");
     Path eigenvectors = new Path(output, "eigenvectors");
     int desiredRank = 13;
     DistributedLanczosSolver solver = new DistributedLanczosSolver();
     Configuration config = new Configuration();
     solver.setConf(config);
     Path testData = getTestTempDirPath("testdata");
     int sampleDimension = sampleData.get(0).get().size();
     solver.run(testData, tmp, eigenvectors, sampleData.size(), 
sampleDimension, false, desiredRank);

     // now multiply the testdata matrix and the eigenvector matrix
     DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors, 
tmp, desiredRank - 1, sampleDimension);
     JobConf conf = new JobConf(config);
     svdT.configure(conf);
     DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp, 
sampleData.size(), sampleDimension);
     a.configure(conf);
     // DMR.times() is really transposeTimes()? Then this should work.
     DistributedRowMatrix sData = a.times(svdT);
     sData.configure(conf);

     // now run the Canopy job to prime kMeans canopies
     CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, 
false, false);
     // now run the KMeans job
     KMeansDriver.runJob(sData.getRowPath(), new Path(output, 
"clusters-0"), output, measure, 0.001, 10, 1, true, false);
     // run ClusterDumper
     ClusterDumper clusterDumper = new ClusterDumper(new Path(output, 
"clusters-2"), new Path(output, "clusteredPoints"));
     clusterDumper.printClusters(termDictionary);
   }



On 9/2/10 10:10 PM, Ted Dunning wrote:
> I think that the solver actually does an SVD, but most of what you say
> follows.
>
> THere is one strangeness, I think in that the DistributedRowMatrix.times is
> doing a transposeTimes operation, not the normal times.
>
> Jake should comment.
>
> On Thu, Sep 2, 2010 at 8:28 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>>   On 9/2/10 7:41 PM, Jeff Eastman wrote:
>>
>>>   Hopefully answering my own question here but ending up with another. The
>>> svd matrix I'd built from the eigenvectors is the wrong shape as I built it.
>>> Taking Jake's "column space" literally and building a matrix where each of
>>> the columns is one of the eigenvectors does give a matrix of the correct
>>> shape. The math works with DenseMatrix, producing a new data matrix which is
>>> 15x7; a significant dimensionality reduction from 15x39.
>>>
>>> In this example, with 15 samples having 39 terms and 7 eigenvectors:
>>>     A = [15x39]
>>>     P = [39x7]
>>>     A P = [15x7]
>>> <snip>
>>>
>> Representing the eigen decomposition math in the above notation, A P is the
>> projection of the data set onto the eigenvector basis:
>>
>> If:
>> A = original data matrix
>> P = eigenvector column matrix
>> D = eigenvalue diagonal matrix
>>
>> Then:
>> A P = P D =>  A = P D P'
>>
>> Since we have A and P is already calculated by DistributedLanczosSolver it
>> is easy to compute A P and we don't need the eigenvalues at all. This is
>> good because the DLS does not output them. Is this why it doesn't bother?
>>
>>

Re: Using SVD with Canopy/KMeans

Posted by Ted Dunning <te...@gmail.com>.

I think that the solver actually does an SVD, but most of what you say
follows.

THere is one strangeness, I think in that the DistributedRowMatrix.times is
doing a transposeTimes operation, not the normal times.

Jake should comment.

On Thu, Sep 2, 2010 at 8:28 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  On 9/2/10 7:41 PM, Jeff Eastman wrote:
>
>>  Hopefully answering my own question here but ending up with another. The
>> svd matrix I'd built from the eigenvectors is the wrong shape as I built it.
>> Taking Jake's "column space" literally and building a matrix where each of
>> the columns is one of the eigenvectors does give a matrix of the correct
>> shape. The math works with DenseMatrix, producing a new data matrix which is
>> 15x7; a significant dimensionality reduction from 15x39.
>>
>> In this example, with 15 samples having 39 terms and 7 eigenvectors:
>>    A = [15x39]
>>    P = [39x7]
>>    A P = [15x7]
>> <snip>
>>
> Representing the eigen decomposition math in the above notation, A P is the
> projection of the data set onto the eigenvector basis:
>
> If:
> A = original data matrix
> P = eigenvector column matrix
> D = eigenvalue diagonal matrix
>
> Then:
> A P = P D => A = P D P'
>
> Since we have A and P is already calculated by DistributedLanczosSolver it
> is easy to compute A P and we don't need the eigenvalues at all. This is
> good because the DLS does not output them. Is this why it doesn't bother?
>
>

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  On 9/2/10 7:41 PM, Jeff Eastman wrote:
>  Hopefully answering my own question here but ending up with another. 
> The svd matrix I'd built from the eigenvectors is the wrong shape as I 
> built it. Taking Jake's "column space" literally and building a matrix 
> where each of the columns is one of the eigenvectors does give a 
> matrix of the correct shape. The math works with DenseMatrix, 
> producing a new data matrix which is 15x7; a significant 
> dimensionality reduction from 15x39.
>
> In this example, with 15 samples having 39 terms and 7 eigenvectors:
>     A = [15x39]
>     P = [39x7]
>     A P = [15x7]
> <snip>
Representing the eigen decomposition math in the above notation, A P is 
the projection of the data set onto the eigenvector basis:

If:
A = original data matrix
P = eigenvector column matrix
D = eigenvalue diagonal matrix

Then:
A P = P D => A = P D P'

Since we have A and P is already calculated by DistributedLanczosSolver 
it is easy to compute A P and we don't need the eigenvalues at all. This 
is good because the DLS does not output them. Is this why it doesn't bother?

Re: Using SVD with Canopy/KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hopefully answering my own question here but ending up with another. 
The svd matrix I'd built from the eigenvectors is the wrong shape as I 
built it. Taking Jake's "column space" literally and building a matrix 
where each of the columns is one of the eigenvectors does give a matrix 
of the correct shape. The math works with DenseMatrix, producing a new 
data matrix which is 15x7; a significant dimensionality reduction from 
15x39.

In this example, with 15 samples having 39 terms and 7 eigenvectors:
     A = [15x39]
     P = [39x7]
     A P = [15x7]

Running Canopy and then KMeans against AP produces 5 clusters the 
goodness of which is a bit hard for me to ascertain right now, but they 
do have the reduced number of terms.

To do this with DistributedRowMatrix, I think I need to use 
svd.transpose() instead to get my original svd matrix into the correct 
shape for multiplication.

For the same example:
     data = [15x39]
     svd = [7x39] as I've constructed it
     svd.transpose = [39x7] which is the correct shape

When I run this; however, it fails with a cardinality exception and I'm 
perplexed by this line from DistributedRowMatrix.times():

     if (numRows != other.numRows()) {
       throw new CardinalityException(numRows, other.numRows());
     }

... which is inconsistent with AbstractMatrix.times() [and, I think, 
also incorrect]:

     int[] c = size();
     int[] o = other.size();
     if (c[COL] != o[ROW]) {
       throw new CardinalityException(c[COL], o[ROW]);
     }

If I change the numRows to numCols in DRM.times() it gets past the 
cardinality test but blows up with an org.apache.hadoop.io.LongWritable 
cannot be cast to org.apache.hadoop.io.IntWritable error somewhere in 
the bowels of times(). I need to keep debugging to localize that.

So the upshot is I can make it work using DenseMatrix but not with 
DistributedRowMatrix.



On 9/2/10 1:45 PM, Jeff Eastman wrote:
>  (cross-posting to dev)
>
> Hi Jake,
>
> I'm on thin ice here, but just a few more words on the math details 
> here would help me sort this out. I've run the 
> DistributedLanczosSolver on the small testdata set in TestClusterDumper:
>
>     Path output = getTestTempDirPath("output");
>     Path tmp = getTestTempDirPath("tmp");
>     Configuration config = new Configuration();
>     Path eigenvectors = new Path(output, "eigenvectors");
>     config.set("mapred.output.dir", eigenvectors.toString());
>     DistributedLanczosSolver solver = new DistributedLanczosSolver();
>     solver.setConf(config);
>     Path testData = getTestTempDirPath("testdata");
>     solver.run(testData, tmp, sampleData.size(), 
> sampleData.get(0).get().size(), false, 8);
>
> This produces 7 (not 8?) vectors in the eigenvectors file. If I then 
> build DistributedRowMatrices out of these I get matrices that are 
> ill-shaped to multiply directly. Clearly a literal translation of your 
> text is incorrect:
>
>     // now multiply the testdata matrix and the eigenvector matrix
>     DistributedRowMatrix svd = new DistributedRowMatrix(eigenvectors, 
> tmp, 8, 38);
>     DistributedRowMatrix data= new DistributedRowMatrix(testData, tmp, 
> 15, 38);
>     DistributedRowMatrix sData = data.times(svd);
>
>     // now run the Canopy job to prime kMeans canopies
>     CanopyDriver.runJob(svd.getRowPath(), output, measure, 8, 4, 
> false, false);
>
> Reading up on eigendecomposition, it looks like (DATA ~= SVD D SVD') 
> would be more like it. But the solver only outputs the eigenvectors 
> and it ignores the eigenvalues. So, I cannot construct D. Can you 
> point me back towards the right path? It has been soo long since my 
> grad school advanced matrices course.
>
> Isn't this related to spectral clustering?
>
>
> On 9/2/10 10:50 AM, Jake Mannix wrote:
>> Derek,
>>
>>    The step Jeff's referring to is that the SVD output is a set of 
>> vectors in
>> the "column space" of your original set of rows (your input matrix).  
>> If you
>> want to cluster your original data, projected onto this new SVD 
>> basis, you
>> need to matrix multiply your SVD matrix by your original data.  
>> Depending on
>> how big your data is (number of rows and columns and rank of the 
>> reduction),
>> you can do this in either one or two map-reduce passes.
>>
>>    If you need more detail, I can spell that out a little more 
>> directly.  It
>> should actually be not just explained in words, but coded into the 
>> examples,
>> now that I think of it... need. more. hours. in. day....
>>
>>    -jake
>