You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vimal Mathew <vm...@gmail.com> on 2010/04/11 22:27:26 UTC

Current state of (dense) matrix multiplication?

Hi,
 What's the current state of matrix-matrix multiplication in Mahout?
Are there any performance results available for large matrices?

 I have been working on a Hadoop-compatible distributed storage for
matrices. I can currently multiply two 40K x 40K dense double
precision matrices in around 1 hour using 34 systems (16GB RAM, two
Core2Quads' per node). I was wondering how this compares with Mahout.

Regards,
 Vimal

Re: Current state of (dense) matrix multiplication?

Posted by Vimal Mathew <vm...@gmail.com>.

In my opinion the most important take-away from MapReduce is that you
dont move the data, but move the "computation towards the data". So I
implemented a persistent storage, and moved most of the critical
computations directly into the storage ( I did play around with Hama
before this).

Right now I dont have any problem in keeping all 8 cores occupied. I
think MapReduce helps reduce the network IOPS because you have data
stored locally at the node. Even in matrix multiplication where I need
to duplicate one table across all the nodes, I get to do this in
around 10 to 15 minutes and then follow it by 45 minutes of
computation. With 34 systems, I get a combined network transfer rate
of around 5.8 Gigabits per second on a gigabit LAN. After all
individual links in a switched network can transmit independently. I
noticed that network speeds scale faster than linearly if I am lucky
enough to schedule transfers properly (I use a  simple random
permutation on each node to decide the sequence of network transfers).

Scheduling the IOPS carefully makes a big difference in a distributed
system, but I still have disk access as a bottleneck.

I currently have all matrices stored on SAS disks (15k rpm). I have
noticed that reading a single file from multiple threads makes a big
difference on these disks (unlike the normal 7.2 rpm disks).  IO
devices have become a lot faster but it does make programming a lot
harder.

I am using a fairly naive matrix multiplication algorithm : A = B * C
where the rows of A are scaled sums of the rows of C. I found this
reasonably easy to parallelize, and the access patterns have good
cache locality.

On Mon, Apr 12, 2010 at 2:23 AM, Ted Dunning <te...@gmail.com> wrote:
> Without outrageous magic FFT algorithms, nxn times nxn matrix multiply
> requires k_1 n^3 flops and at least k_2 n^2 I/O operations.  If you can
> arrange to re-use every operand a LOT, then you only have to do 2 n^2 IOPS,
> but if you fail in that, then I/O tends toward n^3 as well.  If you do
> manage to get sufficient re-use, then the fact that FLOPS are faster than
> IOPS let's us be CPU-bound.  In a single machine, IOPS are pretty fast so
> you can saturate the CPU more easily than some other case.
>
> For map-reduce on multiple machines, the IOPS/FLOPS ratio becomes evil and
> keeping the CPU loaded becomes much harder.   In general, it is hard to
> transfer even 50MBytes/s into a single mapper which is only 5M floating
> point numbers per second.  At a perfect 40K FLOPS per number, this is only
> 200MFlops per CPU (with several cores!!).  Well tuned dense multiply
> routines should crank out about 2-3GFlops per core.  Thus, for an 8 core
> machine, we have a big deficiency to overcome before Hadoop style map-reduce
> is practical for this.
>
> For lots of algorithms, this gets even worse.  The convergence rate of
> on-line gradient suffers badly with any kind of delay between computation of
> the gradient and its application to the model.  Batching is a kind of
> delay.  Moreover, a single CPU is often entirely sufficient to saturate I/O
> devices.  This means that parallelizing the algorithm gives almost no wall
> clock convergence speedup at all even though all CPU's might be very, very
> busy.
>
> On Sun, Apr 11, 2010 at 9:27 PM, Steven Buss <st...@gmail.com> wrote:
>
>> If you're just doing matrix multiplication, I would advise that mahout
>> (or any mapreduce approach) isn't well suited to your problem. I did
>> the same computation with matlab (multiplying two 40k x 40k random
>> double precision dense matrices) using 14 cores and about 36GB of ram
>> on a single machine* and it finished in about 55 minutes. If I'm
>> reading your email correctly, you were working with 34*2*4=272 cores!
>> I'm not sure if dense matrix multiplication can actually be
>> efficiently mapreduced, but I am still a rookie so don't take my word
>> for it.
>>
>> *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz
>> per core, with 64GB total system memory.
>>
>> Steven Buss
>> steven.buss@gmail.com
>> http://www.stevenbuss.com/
>>
>>
>>
>> On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Vimal,
>> >
>> > We don't have any distributed dense multiplication operations because we
>> > have not yet found much application demand for distributed dense matrix
>> > multiplication.  Distributed sparse matrix operations are a big deal,
>> > however.
>> >
>> > If you are interested in working on the problem in the context of Mahout,
>> we
>> > would love to help.  This is especially true if you have an application
>> that
>> > needs dense operations and could benefit from some of the other
>> capabilities
>> > in Mahout.
>> >
>> > On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vm...@gmail.com>
>> wrote:
>> >
>> >> Hi,
>> >>  What's the current state of matrix-matrix multiplication in Mahout?
>> >> Are there any performance results available for large matrices?
>> >>
>> >>  I have been working on a Hadoop-compatible distributed storage for
>> >> matrices. I can currently multiply two 40K x 40K dense double
>> >> precision matrices in around 1 hour using 34 systems (16GB RAM, two
>> >> Core2Quads' per node). I was wondering how this compares with Mahout.
>> >>
>> >> Regards,
>> >>  Vimal
>> >>
>> >
>>
>

Re: Current state of (dense) matrix multiplication?

Posted by Ted Dunning <te...@gmail.com>.

Without outrageous magic FFT algorithms, nxn times nxn matrix multiply
requires k_1 n^3 flops and at least k_2 n^2 I/O operations.  If you can
arrange to re-use every operand a LOT, then you only have to do 2 n^2 IOPS,
but if you fail in that, then I/O tends toward n^3 as well.  If you do
manage to get sufficient re-use, then the fact that FLOPS are faster than
IOPS let's us be CPU-bound.  In a single machine, IOPS are pretty fast so
you can saturate the CPU more easily than some other case.

For map-reduce on multiple machines, the IOPS/FLOPS ratio becomes evil and
keeping the CPU loaded becomes much harder.   In general, it is hard to
transfer even 50MBytes/s into a single mapper which is only 5M floating
point numbers per second.  At a perfect 40K FLOPS per number, this is only
200MFlops per CPU (with several cores!!).  Well tuned dense multiply
routines should crank out about 2-3GFlops per core.  Thus, for an 8 core
machine, we have a big deficiency to overcome before Hadoop style map-reduce
is practical for this.

For lots of algorithms, this gets even worse.  The convergence rate of
on-line gradient suffers badly with any kind of delay between computation of
the gradient and its application to the model.  Batching is a kind of
delay.  Moreover, a single CPU is often entirely sufficient to saturate I/O
devices.  This means that parallelizing the algorithm gives almost no wall
clock convergence speedup at all even though all CPU's might be very, very
busy.

On Sun, Apr 11, 2010 at 9:27 PM, Steven Buss <st...@gmail.com> wrote:

> If you're just doing matrix multiplication, I would advise that mahout
> (or any mapreduce approach) isn't well suited to your problem. I did
> the same computation with matlab (multiplying two 40k x 40k random
> double precision dense matrices) using 14 cores and about 36GB of ram
> on a single machine* and it finished in about 55 minutes. If I'm
> reading your email correctly, you were working with 34*2*4=272 cores!
> I'm not sure if dense matrix multiplication can actually be
> efficiently mapreduced, but I am still a rookie so don't take my word
> for it.
>
> *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz
> per core, with 64GB total system memory.
>
> Steven Buss
> steven.buss@gmail.com
> http://www.stevenbuss.com/
>
>
>
> On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Vimal,
> >
> > We don't have any distributed dense multiplication operations because we
> > have not yet found much application demand for distributed dense matrix
> > multiplication.  Distributed sparse matrix operations are a big deal,
> > however.
> >
> > If you are interested in working on the problem in the context of Mahout,
> we
> > would love to help.  This is especially true if you have an application
> that
> > needs dense operations and could benefit from some of the other
> capabilities
> > in Mahout.
> >
> > On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vm...@gmail.com>
> wrote:
> >
> >> Hi,
> >>  What's the current state of matrix-matrix multiplication in Mahout?
> >> Are there any performance results available for large matrices?
> >>
> >>  I have been working on a Hadoop-compatible distributed storage for
> >> matrices. I can currently multiply two 40K x 40K dense double
> >> precision matrices in around 1 hour using 34 systems (16GB RAM, two
> >> Core2Quads' per node). I was wondering how this compares with Mahout.
> >>
> >> Regards,
> >>  Vimal
> >>
> >
>

Re: Current state of (dense) matrix multiplication?

Posted by Vimal Mathew <vm...@gmail.com>.

The naive matrix-multiplication algorithm is highly parallelizable if
you have the data available locally at all the nodes. The persistent
storage issue was one of the first problems that I tried solving (HDFS
is just wrong for the access patterns in matrix algorithms).

I cant compete with Matlab yet! But I am planning to add support for
SSE2 instructions, so I might get close. Also I dont have systems with
64G RAM, or 14 cores at one place :(
I hope to get much better results in a month or two.


On Mon, Apr 12, 2010 at 12:27 AM, Steven Buss <st...@gmail.com> wrote:
> If you're just doing matrix multiplication, I would advise that mahout
> (or any mapreduce approach) isn't well suited to your problem. I did
> the same computation with matlab (multiplying two 40k x 40k random
> double precision dense matrices) using 14 cores and about 36GB of ram
> on a single machine* and it finished in about 55 minutes. If I'm
> reading your email correctly, you were working with 34*2*4=272 cores!
> I'm not sure if dense matrix multiplication can actually be
> efficiently mapreduced, but I am still a rookie so don't take my word
> for it.
>
> *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz
> per core, with 64GB total system memory.
>
> Steven Buss
> steven.buss@gmail.com
> http://www.stevenbuss.com/
>
>
>
> On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <te...@gmail.com> wrote:
>> Vimal,
>>
>> We don't have any distributed dense multiplication operations because we
>> have not yet found much application demand for distributed dense matrix
>> multiplication.  Distributed sparse matrix operations are a big deal,
>> however.
>>
>> If you are interested in working on the problem in the context of Mahout, we
>> would love to help.  This is especially true if you have an application that
>> needs dense operations and could benefit from some of the other capabilities
>> in Mahout.
>>
>> On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vm...@gmail.com> wrote:
>>
>>> Hi,
>>>  What's the current state of matrix-matrix multiplication in Mahout?
>>> Are there any performance results available for large matrices?
>>>
>>>  I have been working on a Hadoop-compatible distributed storage for
>>> matrices. I can currently multiply two 40K x 40K dense double
>>> precision matrices in around 1 hour using 34 systems (16GB RAM, two
>>> Core2Quads' per node). I was wondering how this compares with Mahout.
>>>
>>> Regards,
>>>  Vimal
>>>
>>
>

Re: Current state of (dense) matrix multiplication?

Posted by Steven Buss <st...@gmail.com>.

If you're just doing matrix multiplication, I would advise that mahout
(or any mapreduce approach) isn't well suited to your problem. I did
the same computation with matlab (multiplying two 40k x 40k random
double precision dense matrices) using 14 cores and about 36GB of ram
on a single machine* and it finished in about 55 minutes. If I'm
reading your email correctly, you were working with 34*2*4=272 cores!
I'm not sure if dense matrix multiplication can actually be
efficiently mapreduced, but I am still a rookie so don't take my word
for it.

*The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz
per core, with 64GB total system memory.

Steven Buss
steven.buss@gmail.com
http://www.stevenbuss.com/

On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <te...@gmail.com> wrote:
> Vimal,
>
> We don't have any distributed dense multiplication operations because we
> have not yet found much application demand for distributed dense matrix
> multiplication.  Distributed sparse matrix operations are a big deal,
> however.
>
> If you are interested in working on the problem in the context of Mahout, we
> would love to help.  This is especially true if you have an application that
> needs dense operations and could benefit from some of the other capabilities
> in Mahout.
>
> On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vm...@gmail.com> wrote:
>
>> Hi,
>>  What's the current state of matrix-matrix multiplication in Mahout?
>> Are there any performance results available for large matrices?
>>
>>  I have been working on a Hadoop-compatible distributed storage for
>> matrices. I can currently multiply two 40K x 40K dense double
>> precision matrices in around 1 hour using 34 systems (16GB RAM, two
>> Core2Quads' per node). I was wondering how this compares with Mahout.
>>
>> Regards,
>>  Vimal
>>
>

Re: Current state of (dense) matrix multiplication?

Posted by Ted Dunning <te...@gmail.com>.

Vimal,

We don't have any distributed dense multiplication operations because we
have not yet found much application demand for distributed dense matrix
multiplication.  Distributed sparse matrix operations are a big deal,
however.

If you are interested in working on the problem in the context of Mahout, we
would love to help.  This is especially true if you have an application that
needs dense operations and could benefit from some of the other capabilities
in Mahout.

On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vm...@gmail.com> wrote:

> Hi,
>  What's the current state of matrix-matrix multiplication in Mahout?
> Are there any performance results available for large matrices?
>
>  I have been working on a Hadoop-compatible distributed storage for
> matrices. I can currently multiply two 40K x 40K dense double
> precision matrices in around 1 hour using 34 systems (16GB RAM, two
> Core2Quads' per node). I was wondering how this compares with Mahout.
>
> Regards,
>  Vimal
>