You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vincent Xue <xu...@gmail.com> on 2011/06/20 21:25:22 UTC

Running Iterative Recursive Least Squares

Hello everyone,

I have been using the Mahout library to implement 'Iterative Recursive
Least Squares' for some analysis on biological microarrays. In my
setup, I have to perform IRLS about 54,000 times on different
probesets, and each job involves multiplying large sparse matrices and
solving linear systems via conjugate gradient. I am using mahout
because my matrices are sparse but large (500,000 x 30,000).

The problem I have now is that each IRLS Job takes about 48 hours to
complete running on a small cluster of 75 nodes with 16 cores each. I
can run about 200 jobs at the same time, but this is not very helpful
considering I have 54,000 jobs to process.

Using the mahout library, the matrix multiplication is greatly sped up
( less than 5 minutes for each product). However solving a linear
system using conjugate gradient is a time consuming process and is the
bulk of the computation. Considering that IRLS calls for several
iterations, this problem is magnified by however many iterations I run
it for.

Considering the issues, I hope someone can help me find a solution.
Below lists several concerns:

1) Is using Mahout for this computation the correct approach. (I have
tried running this in R and the simple step of multiplying matrices
would take days, if it could fit in memory)
2) When running the IRLS, even with 200 jobs running (or maps), the
cpu usage for each node barely goes above 5%. How can I use more CPU?

Thank you for your help
Vincent

Re: Running Iterative Recursive Least Squares

Posted by Ted Dunning <te...@gmail.com>.
With LSMR, you can pass in an abstract multiply function.  This avoids the
fill-in.

Thus, you would need to store 1.4 + 0.5 million non-zeros.  Each one takes
about 12 bytes so this is about 24MB (i.e. tiny).  This trades memory for a
bit of CPU time.

Your other option is 20% x 30,000 ^ 2 = 20% x 900M elements = 200M non-zero
elements = 2.4GB per matrix.

Option 1 would let you have lots of slots on each machine.  Option 2 would
limit you to 10 or fewer live slots this limiting your ability to saturate
CPU.

On Mon, Jun 20, 2011 at 11:25 PM, Vincent Xue <xu...@gmail.com> wrote:

> Hi Danny, thanks for your suggestion. I will take a look.
>
> To answer your questions Ted,
>
> Each slave-node has 32GB of memory. I will look into the LSMR
> implementation and hopefully it will prove to be faster than the
> conjugate gradient solver.
>
> For the matrix multiplication part of the algorithm, each matrix
> multiplied is either a design matrix "X" (1.4M non zero elements) or a
> diagonal weight matrix "W" (0.5M non zero elements). The input matrix
> to the conjugate gradient solver is the product of (X^T * W * X)
> resulting in a symmetric matrix with about 20 % being non zero.
>
> Vincent
>
> On Mon, Jun 20, 2011 at 9:21 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Sounds like you should invert your loops.
> >
> > These sparse matrices are probably very reasonable for solving each one
> on a
> > single machine in memory.
> > As such, take a look at the LSMR implementation which is a
> > good implementation of a conjugate
> > gradient-like algorithm that plays nice with sparse data.  It could
> probably
> > even be used in a re-weighting
> > scheme where the matrix being solved varies in each iteration (at a
> serious
> > cost in understanding for the
> > algorithm).
> >
> > What I would suggest is that you define your mapper so that each mapper
> > solves one of your problem LS problems.
> > You have 50,000 of them and so that will provide plenty of parallelism.
> >  LSMR works out of core if you have to
> > do that, but it really sounds like you should be able to handle several
> in
> > memory at the same time.
> >
> > Can you say more about your memory on each node and how many non-zero
> > elements are in your matrices?
> >
> > On Mon, Jun 20, 2011 at 12:25 PM, Vincent Xue <xu...@gmail.com> wrote:
> >
> >> Hello everyone,
> >>
> >> I have been using the Mahout library to implement 'Iterative Recursive
> >> Least Squares' for some analysis on biological microarrays. In my
> >> setup, I have to perform IRLS about 54,000 times on different
> >> probesets, and each job involves multiplying large sparse matrices and
> >> solving linear systems via conjugate gradient. I am using mahout
> >> because my matrices are sparse but large (500,000 x 30,000).
> >>
> >> The problem I have now is that each IRLS Job takes about 48 hours to
> >> complete running on a small cluster of 75 nodes with 16 cores each. I
> >> can run about 200 jobs at the same time, but this is not very helpful
> >> considering I have 54,000 jobs to process.
> >>
> >> Using the mahout library, the matrix multiplication is greatly sped up
> >> ( less than 5 minutes for each product). However solving a linear
> >> system using conjugate gradient is a time consuming process and is the
> >> bulk of the computation. Considering that IRLS calls for several
> >> iterations, this problem is magnified by however many iterations I run
> >> it for.
> >>
> >> Considering the issues, I hope someone can help me find a solution.
> >> Below lists several concerns:
> >>
> >> 1) Is using Mahout for this computation the correct approach. (I have
> >> tried running this in R and the simple step of multiplying matrices
> >> would take days, if it could fit in memory)
> >> 2) When running the IRLS, even with 200 jobs running (or maps), the
> >> cpu usage for each node barely goes above 5%. How can I use more CPU?
> >>
> >> Thank you for your help
> >> Vincent
> >>
> >
>

Re: Running Iterative Recursive Least Squares

Posted by Vincent Xue <xu...@gmail.com>.
Hi Danny, thanks for your suggestion. I will take a look.

To answer your questions Ted,

Each slave-node has 32GB of memory. I will look into the LSMR
implementation and hopefully it will prove to be faster than the
conjugate gradient solver.

For the matrix multiplication part of the algorithm, each matrix
multiplied is either a design matrix "X" (1.4M non zero elements) or a
diagonal weight matrix "W" (0.5M non zero elements). The input matrix
to the conjugate gradient solver is the product of (X^T * W * X)
resulting in a symmetric matrix with about 20 % being non zero.

Vincent

On Mon, Jun 20, 2011 at 9:21 PM, Ted Dunning <te...@gmail.com> wrote:
> Sounds like you should invert your loops.
>
> These sparse matrices are probably very reasonable for solving each one on a
> single machine in memory.
> As such, take a look at the LSMR implementation which is a
> good implementation of a conjugate
> gradient-like algorithm that plays nice with sparse data.  It could probably
> even be used in a re-weighting
> scheme where the matrix being solved varies in each iteration (at a serious
> cost in understanding for the
> algorithm).
>
> What I would suggest is that you define your mapper so that each mapper
> solves one of your problem LS problems.
> You have 50,000 of them and so that will provide plenty of parallelism.
>  LSMR works out of core if you have to
> do that, but it really sounds like you should be able to handle several in
> memory at the same time.
>
> Can you say more about your memory on each node and how many non-zero
> elements are in your matrices?
>
> On Mon, Jun 20, 2011 at 12:25 PM, Vincent Xue <xu...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> I have been using the Mahout library to implement 'Iterative Recursive
>> Least Squares' for some analysis on biological microarrays. In my
>> setup, I have to perform IRLS about 54,000 times on different
>> probesets, and each job involves multiplying large sparse matrices and
>> solving linear systems via conjugate gradient. I am using mahout
>> because my matrices are sparse but large (500,000 x 30,000).
>>
>> The problem I have now is that each IRLS Job takes about 48 hours to
>> complete running on a small cluster of 75 nodes with 16 cores each. I
>> can run about 200 jobs at the same time, but this is not very helpful
>> considering I have 54,000 jobs to process.
>>
>> Using the mahout library, the matrix multiplication is greatly sped up
>> ( less than 5 minutes for each product). However solving a linear
>> system using conjugate gradient is a time consuming process and is the
>> bulk of the computation. Considering that IRLS calls for several
>> iterations, this problem is magnified by however many iterations I run
>> it for.
>>
>> Considering the issues, I hope someone can help me find a solution.
>> Below lists several concerns:
>>
>> 1) Is using Mahout for this computation the correct approach. (I have
>> tried running this in R and the simple step of multiplying matrices
>> would take days, if it could fit in memory)
>> 2) When running the IRLS, even with 200 jobs running (or maps), the
>> cpu usage for each node barely goes above 5%. How can I use more CPU?
>>
>> Thank you for your help
>> Vincent
>>
>

Re: Running Iterative Recursive Least Squares

Posted by Ted Dunning <te...@gmail.com>.
Sounds like you should invert your loops.

These sparse matrices are probably very reasonable for solving each one on a
single machine in memory.
As such, take a look at the LSMR implementation which is a
good implementation of a conjugate
gradient-like algorithm that plays nice with sparse data.  It could probably
even be used in a re-weighting
scheme where the matrix being solved varies in each iteration (at a serious
cost in understanding for the
algorithm).

What I would suggest is that you define your mapper so that each mapper
solves one of your problem LS problems.
You have 50,000 of them and so that will provide plenty of parallelism.
 LSMR works out of core if you have to
do that, but it really sounds like you should be able to handle several in
memory at the same time.

Can you say more about your memory on each node and how many non-zero
elements are in your matrices?

On Mon, Jun 20, 2011 at 12:25 PM, Vincent Xue <xu...@gmail.com> wrote:

> Hello everyone,
>
> I have been using the Mahout library to implement 'Iterative Recursive
> Least Squares' for some analysis on biological microarrays. In my
> setup, I have to perform IRLS about 54,000 times on different
> probesets, and each job involves multiplying large sparse matrices and
> solving linear systems via conjugate gradient. I am using mahout
> because my matrices are sparse but large (500,000 x 30,000).
>
> The problem I have now is that each IRLS Job takes about 48 hours to
> complete running on a small cluster of 75 nodes with 16 cores each. I
> can run about 200 jobs at the same time, but this is not very helpful
> considering I have 54,000 jobs to process.
>
> Using the mahout library, the matrix multiplication is greatly sped up
> ( less than 5 minutes for each product). However solving a linear
> system using conjugate gradient is a time consuming process and is the
> bulk of the computation. Considering that IRLS calls for several
> iterations, this problem is magnified by however many iterations I run
> it for.
>
> Considering the issues, I hope someone can help me find a solution.
> Below lists several concerns:
>
> 1) Is using Mahout for this computation the correct approach. (I have
> tried running this in R and the simple step of multiplying matrices
> would take days, if it could fit in memory)
> 2) When running the IRLS, even with 200 jobs running (or maps), the
> cpu usage for each node barely goes above 5%. How can I use more CPU?
>
> Thank you for your help
> Vincent
>