You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vincent Xue <xu...@gmail.com> on 2011/05/06 15:01:42 UTC

Transposing a matrix is limited by how large a node is.

Dear Mahout Users,

I am using Mahout-0.5-SNAPSHOT to transpose a dense matrix of 55000 x 31000.
My matrix is in stored on the HDFS as a
SequenceFile<IntWritable,VectorWritable>, consuming just about 13 GB. When I
run the transpose function on my matrix, the function falls over during the
reduce phase. With closer inspection, I noticed that I was receiving the
following error:

FSError: java.io.IOException: No space left on device

I thought this was not possible considering that I was only using 15% of the
2.5 TB in the cluster but when I closely monitored the disk space, it was
true that the 40 GB hard drive on the node was running out of space.
Unfortunately, all of my nodes are limited to 40 GB and I have not been
successful in transposing my matrix.

>From this observation, I would like to know if there is any alternative
method to transpose my matrix or if there is something I am missing?

Thanks,
Vincent

Re: Transposing a matrix is limited by how large a node is.

Posted by Vincent Xue <xu...@gmail.com>.
The Lanczos implementation of SVD worked very well with my dense
matrix. I ran several iterations to confirm that I had the the top 3
eigen vectors of my matrix and used these vectors to visualize the top
principal components of my data.

As for the transpose code, I believe that the last part of the code
could benefit from some feedback. In my implementation I am spawning
multiple jobs, for as many splits as needed, so that a single node
will not run out of disk space. The last step calls for a sequential
combination of the pieces into one sequence file which is probably a
bad approach. I am sequentially combining the pieces because I want to
use the output in other mahout jobs.

Instead of running this slow process, I was thinking that it would be
better to keep the output in separate large chunks, and perform
further jobs with Hadoop's MultiFileInputFormat. The problem with this
however once a matrix is split, I do not know of any way to use the
split sequence files in other Mahout jobs, other than writing
dedicated Java code specifying the multi input files to the job.

My questions are:
What would be the preferred way of storing large matrices, or even
files on the HDFS?
Is it efficient to perform many small mapred jobs on the same matrix?
(considering that jobs are moving and the data isn't)

-Vincent

On Fri, May 6, 2011 at 4:18 PM, Ted Dunning <te...@gmail.com> wrote:
>
> If you have the code and would like to contribute it, file a JIRA and attach
> a patch.
>
> It will be interesting to hear how the SVD proceeds.  Such a large dense
> matrix is an unusual target for SVD.
>
> Also, it is possible to adapt the R version of random projection to never
> keep all of the large matrix in memory.  Instead, only slices of the matrix
> are kept and the multiplications involved are done progressively.  The
> results are kept in memory, but not the large matrix.  This would probably
> make your sequential version fast enough to use.  R may not be usable unless
> it can read the portions of your large matrix quickly using binary I/O.
>
> Also, I suspect that you are trying to get the transpose in order to
> decompose A' A.  This is not necessary as far as I can tell since you can
> simply decompose A and use that to compute the decomposition of A' A even
> faster than you can compute the decomposition of A itself.
>
> On Fri, May 6, 2011 at 7:36 AM, Vincent Xue <xu...@gmail.com> wrote:
>
> > Because I am limited by my resources, I  coded up a slower but effective
> > implementation of the transpose job that I could share. It avoids loading
> > all the data on to one node by transposing the matrix in pieces. The
> > slowest
> > part of this is combining the pieces back to one matrix. :(
> >

Re: Transposing a matrix is limited by how large a node is.

Posted by Ted Dunning <te...@gmail.com>.
If you have the code and would like to contribute it, file a JIRA and attach
a patch.

It will be interesting to hear how the SVD proceeds.  Such a large dense
matrix is an unusual target for SVD.

Also, it is possible to adapt the R version of random projection to never
keep all of the large matrix in memory.  Instead, only slices of the matrix
are kept and the multiplications involved are done progressively.  The
results are kept in memory, but not the large matrix.  This would probably
make your sequential version fast enough to use.  R may not be usable unless
it can read the portions of your large matrix quickly using binary I/O.

Also, I suspect that you are trying to get the transpose in order to
decompose A' A.  This is not necessary as far as I can tell since you can
simply decompose A and use that to compute the decomposition of A' A even
faster than you can compute the decomposition of A itself.

On Fri, May 6, 2011 at 7:36 AM, Vincent Xue <xu...@gmail.com> wrote:

> Because I am limited by my resources, I  coded up a slower but effective
> implementation of the transpose job that I could share. It avoids loading
> all the data on to one node by transposing the matrix in pieces. The
> slowest
> part of this is combining the pieces back to one matrix. :(
>

Re: Transposing a matrix is limited by how large a node is.

Posted by Vincent Xue <xu...@gmail.com>.
Hi Jake,
As requested the stats from the job are listed below:

Counter Map Reduce Total
Job Counters Launched reduce tasks 0 0 2
Rack-local map tasks 0 0 69
Launched map tasks 0 0 194
Data-local map tasks 0 0 125
FileSystemCounters FILE_BYTES_READ 66,655,795,630 0 66,655,795,630
HDFS_BYTES_READ 12,871,657,393 0 12,871,657,393
FILE_BYTES_WRITTEN 103,841,910,638 0 103,841,910,638
Map-Reduce Framework Combine output records 0 0 0
Map input records 54,675 0 54,675
Spilled Records 4,720,084,588 0 4,720,084,588
Map output bytes 33,805,552,500 0 33,805,552,500
Map input bytes 12,804,666,825 0 12,804,666,825
Map output records 1,690,277,625 0 1,690,277,625
Combine input records 0 0 0

In response to your suggestion, I do have a server with lots of ram however
I would like to stick to having files on the HDFS.  As I am running some PCA
analysis I would have to reimport the data back into HDFS to run SVD. (We
tried to run similar computations on a machine with >64GB and the previous R
implementation crashed after a few days...)

Because I am limited by my resources, I  coded up a slower but effective
implementation of the transpose job that I could share. It avoids loading
all the data on to one node by transposing the matrix in pieces. The slowest
part of this is combining the pieces back to one matrix. :(

-Vincent


On Fri, May 6, 2011 at 2:54 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> On Fri, May 6, 2011 at 6:01 AM, Vincent Xue <xu...@gmail.com> wrote:
>
> > Dear Mahout Users,
> >
> > I am using Mahout-0.5-SNAPSHOT to transpose a dense matrix of 55000 x
> > 31000.
> > My matrix is in stored on the HDFS as a
> > SequenceFile<IntWritable,VectorWritable>, consuming just about 13 GB.
When
> > I
> > run the transpose function on my matrix, the function falls over during
the
> > reduce phase. With closer inspection, I noticed that I was receiving the
> > following error:
> >
> > FSError: java.io.IOException: No space left on device
> >
> > I thought this was not possible considering that I was only using 15% of
> > the
> > 2.5 TB in the cluster but when I closely monitored the disk space, it
was
> > true that the 40 GB hard drive on the node was running out of space.
> > Unfortunately, all of my nodes are limited to 40 GB and I have not been
> > successful in transposing my matrix.
> >
>
> Running HDFS with nodes with only 40GB of hard disk each is a recipe
> for disaster, IMO.  There are lots of temporary files created by
map/reduce
> jobs, and working on an input file of size 13GB you're bound to run into
> this.
>
> Can you show us what your job tracker says the amount of
> HDFS_BYTES_WRITTEN (and other similar numbers) during your job?
>
>
> > From this observation, I would like to know if there is any alternative
> > method to transpose my matrix or if there is something I am missing?
>
>
> Do you have a server with 26GB of RAM lying around somewhere?
> You could do it on one machine without hitting disk. :)
>
>  -jake

Re: Transposing a matrix is limited by how large a node is.

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, May 6, 2011 at 6:01 AM, Vincent Xue <xu...@gmail.com> wrote:

> Dear Mahout Users,
>
> I am using Mahout-0.5-SNAPSHOT to transpose a dense matrix of 55000 x
> 31000.
> My matrix is in stored on the HDFS as a
> SequenceFile<IntWritable,VectorWritable>, consuming just about 13 GB. When
> I
> run the transpose function on my matrix, the function falls over during the
> reduce phase. With closer inspection, I noticed that I was receiving the
> following error:
>
> FSError: java.io.IOException: No space left on device
>
> I thought this was not possible considering that I was only using 15% of
> the
> 2.5 TB in the cluster but when I closely monitored the disk space, it was
> true that the 40 GB hard drive on the node was running out of space.
> Unfortunately, all of my nodes are limited to 40 GB and I have not been
> successful in transposing my matrix.
>

Running HDFS with nodes with only 40GB of hard disk each is a recipe
for disaster, IMO.  There are lots of temporary files created by map/reduce
jobs, and working on an input file of size 13GB you're bound to run into
this.

Can you show us what your job tracker says the amount of
HDFS_BYTES_WRITTEN (and other similar numbers) during your job?


> From this observation, I would like to know if there is any alternative
> method to transpose my matrix or if there is something I am missing?


Do you have a server with 26GB of RAM lying around somewhere?
You could do it on one machine without hitting disk. :)

  -jake