You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/07/06 15:09:40 UTC

SVD Memory Reqs

Anyone have guidelines on needed heap size when running SVD?  I've done a couple of fairly long runs on my single machine and keep running out of mem. fairly deep into the run.  Before I increase the heap size for the 4th time, I figured I'd see if it is even going to fit into memory at all.

My matrix is ~ 130,000 x 62,000 and I have 4GB total on my machine.  I'm running this locally for now as a first step in scaling it out.

Here's my command:  ./mahout svd -Dmapred.input.dir=/tmp/solr-clust-n2/part-out.vec --numCols 61892 --tempDir /tmp/solr-clust-n2-svd --rank 1000 --numRows 129444

Thanks,
Grant

Re: SVD Memory Reqs

Posted by Ted Dunning <te...@gmail.com>.

Yes.  It is.  (the number of non-zero singular values, that is)

Also, rank is the dimension of the space spanned by a matrix.  The rank of
the outer product of two vectors is 1 (except when one of them is zero).
 The sum of two independent matrices with rank k_1 and k_2 is k_1 + k_2.
 Independent means that the spaces that they span share only the origin.
 You can view SVD as finding the most important dimensions of the span of a
matrix.  You can also view the SVD as a sum of rank one matrices formed by
the outer products of the corresponding left and right singular vectors.

Since the singular vectors are orthonormal, their outer products are
independent and thus the sum of k such outer products has rank k.

If you view SVD as compressing the information in a matrix, then the rank of
the result is a reasonable measure of how much information remains.

On Tue, Jul 6, 2010 at 12:15 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Question: What exactly is the rank, anyway?  It's the number of singular
> values, right?
>

RE: DistributedRowMatrix.transpose().transpose() = Exception

Posted by Laszlo Dosa <la...@fredhopper.com>.

Hi Peter,

The problem was that the cardinality was incorrect.

Thanks,
Laszlo

-----Original Message-----
From: Peter M. Goldstein [mailto:peter_m_goldstein@yahoo.com] 
Sent: woensdag 7 juli 2010 18:54
To: user@mahout.apache.org
Subject: RE: DistributedRowMatrix.transpose().transpose() = Exception

Hi Laszlo,

The exception message:

org.apache.mahout.math.IndexException: Index 31 is outside allowable range of [0,31]

is a little misleading, as it should actually read "allowable range of [0,30]" for this case, as the index is required to be strictly less than the size of the vector.  So somehow the column index is being populated at 31 (and possibly higher values) during the second transpose.

To me this suggests that there may be an off-by-one error introduced into the index during the transpose process.  This might not cause an error in the original transpose if the input data has zeroes in the right place.

So a few quick questions:

i) What's the structure of the input matrix?  
ii) Have you confirmed that the output of a single transpose is correct?

--Peter

-----Original Message-----
From: Laszlo Dosa [mailto:laszlo.dosa@fredhopper.com] 
Sent: Wednesday, July 07, 2010 6:39 AM
To: user@mahout.apache.org
Subject: DistributedRowMatrix.transpose().transpose() = Exception

Hi,

As far as I know if I transpose a matrix twice I should get back the original matrix.

I tried to do this with DistributedRowMatrix (trunk version).  My sample matrix has 14 rows and 31 columns.

I got the following exception:
org.apache.mahout.math.IndexException: Index 31 is outside allowable range of [0,31]
                at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:324)
                at org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
                at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:144)
                at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:1)
                at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
                at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
                at org.apache.hadoop.mapred.Child.main(Child.java:170)

Exception in thread "main" java.io.IOException: Job failed!
                at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1293)
                at org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:153)
                at com.fredhopper.MatrixTransposeJob.run(MatrixTransposeJob.java:46)
                at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                at com.fredhopper.MatrixTransposeJob.main(MatrixTransposeJob.java:52)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Does anyone had the same issues or knows how to sole it?

Thanks,
Laszlo

I runned:
hadoop jar matrix-transpose.jar \
com.fredhopper.MatrixTransposeJob \
-i input/ \
-o output/ \
--numRows 14 \
--numCols 31

My code is:
package com.fredhopper;

import java.io.IOException;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.common.AbstractJob;
import org.apache.mahout.math.hadoop.DistributedRowMatrix;

public class MatrixTransposeJob extends AbstractJob {

                @SuppressWarnings("deprecation")
                @Override
                public int run(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
                               addInputOption();
                               addOutputOption();

                               addOption("numRows", "nr", "Number of rows of the input matrix");
                               addOption("numCols", "nc", "Number of columns of the input matrix");

                               Configuration originalConfig = getConf();

                               Map<String,String> parsedArgs = parseArguments(args);
                               if (parsedArgs == null) {
                                               return -1;
                               }

                               Path inputPath = getInputPath();
                               Path outputPath =  getOutputPath();

                               int numRows = Integer.parseInt(parsedArgs.get("--numRows"));
                               int numCols = Integer.parseInt(parsedArgs.get("--numCols"));

                               DistributedRowMatrix matrix = new DistributedRowMatrix(inputPath,
                                                               outputPath,
                                                               numRows,
                                                               numCols);
                               
                               JobConf conf = new JobConf(originalConfig);
                               matrix.configure(conf);

                               DistributedRowMatrix t1 = matrix.transpose();
                               DistributedRowMatrix t2 = t1.transpose();

                               return 0;
                }

                public static void main(String[] args) throws Exception {
                               ToolRunner.run(new Configuration(), new MatrixTransposeJob(), args);
                }

}

RE: DistributedRowMatrix.transpose().transpose() = Exception

Posted by "Peter M. Goldstein" <pe...@yahoo.com>.

Hi Laszlo,

The exception message:

org.apache.mahout.math.IndexException: Index 31 is outside allowable range of [0,31]

is a little misleading, as it should actually read "allowable range of [0,30]" for this case, as the index is required to be strictly less than the size of the vector.  So somehow the column index is being populated at 31 (and possibly higher values) during the second transpose.

To me this suggests that there may be an off-by-one error introduced into the index during the transpose process.  This might not cause an error in the original transpose if the input data has zeroes in the right place.

So a few quick questions:

i) What's the structure of the input matrix?  
ii) Have you confirmed that the output of a single transpose is correct?

--Peter

-----Original Message-----
From: Laszlo Dosa [mailto:laszlo.dosa@fredhopper.com] 
Sent: Wednesday, July 07, 2010 6:39 AM
To: user@mahout.apache.org
Subject: DistributedRowMatrix.transpose().transpose() = Exception

Hi,

As far as I know if I transpose a matrix twice I should get back the original matrix.

I tried to do this with DistributedRowMatrix (trunk version).  My sample matrix has 14 rows and 31 columns.

I got the following exception:
org.apache.mahout.math.IndexException: Index 31 is outside allowable range of [0,31]
                at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:324)
                at org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
                at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:144)
                at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:1)
                at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
                at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
                at org.apache.hadoop.mapred.Child.main(Child.java:170)

Exception in thread "main" java.io.IOException: Job failed!
                at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1293)
                at org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:153)
                at com.fredhopper.MatrixTransposeJob.run(MatrixTransposeJob.java:46)
                at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                at com.fredhopper.MatrixTransposeJob.main(MatrixTransposeJob.java:52)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Does anyone had the same issues or knows how to sole it?

Thanks,
Laszlo

I runned:
hadoop jar matrix-transpose.jar \
com.fredhopper.MatrixTransposeJob \
-i input/ \
-o output/ \
--numRows 14 \
--numCols 31

My code is:
package com.fredhopper;

import java.io.IOException;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.common.AbstractJob;
import org.apache.mahout.math.hadoop.DistributedRowMatrix;

public class MatrixTransposeJob extends AbstractJob {

                @SuppressWarnings("deprecation")
                @Override
                public int run(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
                               addInputOption();
                               addOutputOption();

                               addOption("numRows", "nr", "Number of rows of the input matrix");
                               addOption("numCols", "nc", "Number of columns of the input matrix");

                               Configuration originalConfig = getConf();

                               Map<String,String> parsedArgs = parseArguments(args);
                               if (parsedArgs == null) {
                                               return -1;
                               }

                               Path inputPath = getInputPath();
                               Path outputPath =  getOutputPath();

                               int numRows = Integer.parseInt(parsedArgs.get("--numRows"));
                               int numCols = Integer.parseInt(parsedArgs.get("--numCols"));

                               DistributedRowMatrix matrix = new DistributedRowMatrix(inputPath,
                                                               outputPath,
                                                               numRows,
                                                               numCols);
                               
                               JobConf conf = new JobConf(originalConfig);
                               matrix.configure(conf);

                               DistributedRowMatrix t1 = matrix.transpose();
                               DistributedRowMatrix t2 = t1.transpose();

                               return 0;
                }

                public static void main(String[] args) throws Exception {
                               ToolRunner.run(new Configuration(), new MatrixTransposeJob(), args);
                }

}

DistributedRowMatrix.transpose().transpose() = Exception

Posted by Laszlo Dosa <la...@fredhopper.com>.

Hi,

As far as I know if I transpose a matrix twice I should get back the original matrix.

I tried to do this with DistributedRowMatrix (trunk version).  My sample matrix has 14 rows and 31 columns.

I got the following exception:
org.apache.mahout.math.IndexException: Index 31 is outside allowable range of [0,31]
                at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:324)
                at org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
                at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:144)
                at org.apache.mahout.math.hadoop.TransposeJob$TransposeReducer.reduce(TransposeJob.java:1)
                at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
                at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
                at org.apache.hadoop.mapred.Child.main(Child.java:170)

Exception in thread "main" java.io.IOException: Job failed!
                at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1293)
                at org.apache.mahout.math.hadoop.DistributedRowMatrix.transpose(DistributedRowMatrix.java:153)
                at com.fredhopper.MatrixTransposeJob.run(MatrixTransposeJob.java:46)
                at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                at com.fredhopper.MatrixTransposeJob.main(MatrixTransposeJob.java:52)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                at java.lang.reflect.Method.invoke(Method.java:597)
                at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Does anyone had the same issues or knows how to sole it?

Thanks,
Laszlo

I runned:
hadoop jar matrix-transpose.jar \
com.fredhopper.MatrixTransposeJob \
-i input/ \
-o output/ \
--numRows 14 \
--numCols 31

My code is:
package com.fredhopper;

import java.io.IOException;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.common.AbstractJob;
import org.apache.mahout.math.hadoop.DistributedRowMatrix;

public class MatrixTransposeJob extends AbstractJob {

                @SuppressWarnings("deprecation")
                @Override
                public int run(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
                               addInputOption();
                               addOutputOption();

                               addOption("numRows", "nr", "Number of rows of the input matrix");
                               addOption("numCols", "nc", "Number of columns of the input matrix");

                               Configuration originalConfig = getConf();

                               Map<String,String> parsedArgs = parseArguments(args);
                               if (parsedArgs == null) {
                                               return -1;
                               }

                               Path inputPath = getInputPath();
                               Path outputPath =  getOutputPath();

                               int numRows = Integer.parseInt(parsedArgs.get("--numRows"));
                               int numCols = Integer.parseInt(parsedArgs.get("--numCols"));

                               DistributedRowMatrix matrix = new DistributedRowMatrix(inputPath,
                                                               outputPath,
                                                               numRows,
                                                               numCols);
                               
                               JobConf conf = new JobConf(originalConfig);
                               matrix.configure(conf);

                               DistributedRowMatrix t1 = matrix.transpose();
                               DistributedRowMatrix t2 = t1.transpose();

                               return 0;
                }

                public static void main(String[] args) throws Exception {
                               ToolRunner.run(new Configuration(), new MatrixTransposeJob(), args);
                }

}

Re: SVD Memory Reqs

Posted by Ted Dunning <te...@gmail.com>.

My unsubstantiated guess is that most of these could actually be replaced
with random vectors with no impact.  All of the studies I have seen that
measure how many singular vectors are necessary change the dimensionality as
they test different numbers.  I think it would be better to keep the
dimensionality constant and just change how many vectors are actually
singular vectors and how many are random.

On Tue, Jul 6, 2010 at 2:27 PM, Jake Mannix <ja...@gmail.com> wrote:

> My rule of thumb has been that for text type stuff (i.e. LSI/LSA),
> something
> around 200-400 is the most you'll ever need.  For smaller corpora and/or
> vocabularies, even below the bottom end of this range is fine
>

Re: SVD Memory Reqs

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Jul 6, 2010 at 9:15 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jul 6, 2010, at 12:46 PM, Ted Dunning wrote:
>
> > Computing 1000 singular vectors is generally neither necessary nor
> helpful.
>
> OK, good to know.  This is my first time ever running SVD, so I have no
> clue what a useful number is for the rank value.  Advice welcome here.

My rule of thumb has been that for text type stuff (i.e. LSI/LSA), something
around 200-400 is the most you'll ever need.  For smaller corpora and/or
vocabularies, even below the bottom end of this range is fine.

  -jake

Re: SVD Memory Reqs

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 6, 2010, at 12:46 PM, Ted Dunning wrote:

> Computing 1000 singular vectors is generally neither necessary nor helpful.

OK, good to know.  This is my first time ever running SVD, so I have no clue what a useful number is for the rank value.  Advice welcome here.  

Question: What exactly is the rank, anyway?  It's the number of singular values, right?

> After just a few dozen, the noise in the system dominates and you are
> essentially just generating very fancy random numbers.  Also, the total
> memory required in the last steps of the SVD is proportional to either
> number of columns or number of rows in your original matrix times the number
> of singular vectors you are producing.
> 
> Try scaling up the rank option from a small number first before blowing out
> your memory requirements.

OK, will do.

> 
> On Tue, Jul 6, 2010 at 6:09 AM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> Anyone have guidelines on needed heap size when running SVD?  I've done a
>> couple of fairly long runs on my single machine and keep running out of mem.
>> fairly deep into the run.  Before I increase the heap size for the 4th time,
>> I figured I'd see if it is even going to fit into memory at all.
>> 
>> My matrix is ~ 130,000 x 62,000 and I have 4GB total on my machine.  I'm
>> running this locally for now as a first step in scaling it out.
>> 
>> Here's my command:  ./mahout svd
>> -Dmapred.input.dir=/tmp/solr-clust-n2/part-out.vec --numCols 61892 --tempDir
>> /tmp/solr-clust-n2-svd --rank 1000 --numRows 129444
>> 
>> Thanks,
>> Grant

Re: SVD Memory Reqs

Posted by Ted Dunning <te...@gmail.com>.

Computing 1000 singular vectors is generally neither necessary nor helpful.
 After just a few dozen, the noise in the system dominates and you are
essentially just generating very fancy random numbers.  Also, the total
memory required in the last steps of the SVD is proportional to either
number of columns or number of rows in your original matrix times the number
of singular vectors you are producing.

Try scaling up the rank option from a small number first before blowing out
your memory requirements.

On Tue, Jul 6, 2010 at 6:09 AM, Grant Ingersoll <gs...@apache.org> wrote:

> Anyone have guidelines on needed heap size when running SVD?  I've done a
> couple of fairly long runs on my single machine and keep running out of mem.
> fairly deep into the run.  Before I increase the heap size for the 4th time,
> I figured I'd see if it is even going to fit into memory at all.
>
> My matrix is ~ 130,000 x 62,000 and I have 4GB total on my machine.  I'm
> running this locally for now as a first step in scaling it out.
>
> Here's my command:  ./mahout svd
> -Dmapred.input.dir=/tmp/solr-clust-n2/part-out.vec --numCols 61892 --tempDir
> /tmp/solr-clust-n2-svd --rank 1000 --numRows 129444
>
> Thanks,
> Grant

Re: SVD Memory Reqs

Posted by Ted Dunning <te...@gmail.com>.

On a 4GB machine, that could be a bunch.

On Tue, Jul 6, 2010 at 2:23 PM, Jake Mannix <ja...@gmail.com> wrote:

> In your case, this
> would be still a fairly modest value, like 62k * 16k = 1GB.
>

Re: SVD Memory Reqs

Posted by Jake Mannix <ja...@gmail.com>.

It should also be noted, that MAHOUT-308 should lower this requirement by
quite a bit.

  -jake

On Tue, Jul 6, 2010 at 11:23 PM, Jake Mannix <ja...@gmail.com> wrote:

> In general, the current SVD impl requires, on the driving machine (ie not
> on the HDFS cluster), at least 2 * rank * numCols * 8bytes.  In your case,
> this would be still a fairly modest value, like 62k * 16k = 1GB.
>
>   -jake
>
> On Tue, Jul 6, 2010 at 3:09 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> Anyone have guidelines on needed heap size when running SVD?  I've done a
>> couple of fairly long runs on my single machine and keep running out of mem.
>> fairly deep into the run.  Before I increase the heap size for the 4th time,
>> I figured I'd see if it is even going to fit into memory at all.
>>
>> My matrix is ~ 130,000 x 62,000 and I have 4GB total on my machine.  I'm
>> running this locally for now as a first step in scaling it out.
>>
>> Here's my command:  ./mahout svd
>> -Dmapred.input.dir=/tmp/solr-clust-n2/part-out.vec --numCols 61892 --tempDir
>> /tmp/solr-clust-n2-svd --rank 1000 --numRows 129444
>>
>> Thanks,
>> Grant
>
>
>

Re: SVD Memory Reqs

Posted by Jake Mannix <ja...@gmail.com>.

In general, the current SVD impl requires, on the driving machine (ie not on
the HDFS cluster), at least 2 * rank * numCols * 8bytes.  In your case, this
would be still a fairly modest value, like 62k * 16k = 1GB.

  -jake

On Tue, Jul 6, 2010 at 3:09 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Anyone have guidelines on needed heap size when running SVD?  I've done a
> couple of fairly long runs on my single machine and keep running out of mem.
> fairly deep into the run.  Before I increase the heap size for the 4th time,
> I figured I'd see if it is even going to fit into memory at all.
>
> My matrix is ~ 130,000 x 62,000 and I have 4GB total on my machine.  I'm
> running this locally for now as a first step in scaling it out.
>
> Here's my command:  ./mahout svd
> -Dmapred.input.dir=/tmp/solr-clust-n2/part-out.vec --numCols 61892 --tempDir
> /tmp/solr-clust-n2-svd --rank 1000 --numRows 129444
>
> Thanks,
> Grant