You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mike Spreitzer <ms...@us.ibm.com> on 2011/11/18 17:59:17 UTC

Matrix multiplication in Hadoop

Who is doing multiplication of large dense matrices using Hadoop?  What is 
a good way to do that computation using Hadoop?

Thanks,
Mike

Re: Matrix multiplication in Hadoop

Posted by Tom Peters <th...@gmail.com>.

I wrote up a basic algorithm for this here:
http://math.columbia.edu/~tpeters/teh-codez/hadoop/hadoop-matrix-mult.html
It's almost certainly not optimal, but might get you some ideas.

Here is another approach http://www.norstad.org/matrix-multiply/index.html

Cheers,

Tom

On Fri, Nov 18, 2011 at 11:59 AM, Mike Spreitzer <ms...@us.ibm.com>wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is
> a good way to do that computation using Hadoop?
>
> Thanks,
> Mike

Re: Matrix multiplication in Hadoop

Posted by "in.abdul" <in...@gmail.com>.

Team ,
  i could not able to read sequencial file which cluster gave . Please help
.
Problem: Sequencial file is returning null

            Thanks and Regards,
        S SYED ABDUL KATHER
                9731841519


On Wed, Nov 23, 2011 at 10:52 AM, Mike Spreitzer [via Lucene] <
ml-node+s472066n3529843h59@n3.nabble.com> wrote:

> I am looking at large dense matrix multiplication as an example problem
> for a class of middleware.  I am also interested in sparse matrices, but
> am taking things one step at a time.
>
> There is a paper in IEEE CloudCom '10 about Hama, including a matrix
> multiplication technique.  It is essentially the same as what is called
> "technique 4" in the 2009 monograph by John Norstad cited early in this
> thread.  Which means that, despite the fact that Hama touts the virtues of
> BSP (a position with which I am very sympathetic), this technique doesn't
> really take advantage of the extra features that BSP has over MapReduce.
> Note also that this technique creates intermediate data of much greater
> volume than the input.  For example, if each matrix is stored as an NxN
> grid of blocks, the intermediate data (the blocks paired up, awaiting
> multiplication) is a factor of N larger than the input.  I have heard
> people saying that N may be rather larger than sqrt(number of machines)
> because in some circumstances N has to be chosen before the number of
> available machines is known and you want to be able to divide the NxN load
> among your machines rather evenly.  Even if N is like sqrt(number of
> machines) this is still an unwelcome amount of bloat.  In comparison, the
> SUMMA technique does matrix multiplication but its intermediate data
> volume is no greater than the input.
>
> Thanks,
> Mike
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Matrix-multiplication-in-Hadoop-tp3519089p3529843.html
>  To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


-----
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: http://lucene.472066.n3.nabble.com/Matrix-multiplication-in-Hadoop-tp3519089p3534254.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Matrix multiplication in Hadoop

Posted by Mike Spreitzer <ms...@us.ibm.com>.

I am looking at large dense matrix multiplication as an example problem 
for a class of middleware.  I am also interested in sparse matrices, but 
am taking things one step at a time.

There is a paper in IEEE CloudCom '10 about Hama, including a matrix 
multiplication technique.  It is essentially the same as what is called 
"technique 4" in the 2009 monograph by John Norstad cited early in this 
thread.  Which means that, despite the fact that Hama touts the virtues of 
BSP (a position with which I am very sympathetic), this technique doesn't 
really take advantage of the extra features that BSP has over MapReduce. 
Note also that this technique creates intermediate data of much greater 
volume than the input.  For example, if each matrix is stored as an NxN 
grid of blocks, the intermediate data (the blocks paired up, awaiting 
multiplication) is a factor of N larger than the input.  I have heard 
people saying that N may be rather larger than sqrt(number of machines) 
because in some circumstances N has to be chosen before the number of 
available machines is known and you want to be able to divide the NxN load 
among your machines rather evenly.  Even if N is like sqrt(number of 
machines) this is still an unwelcome amount of bloat.  In comparison, the 
SUMMA technique does matrix multiplication but its intermediate data 
volume is no greater than the input.

Thanks,
Mike

Re: Matrix multiplication in Hadoop

Posted by Lance Norskog <go...@gmail.com>.

Look for uses of the DistributedRowMatrix in the Mahout code. The existing
Mahout jobs are generally end-to-end algorithm implementations which do
things like matrix multiplication in the middle. Also, the Mahout
algorithms generally prefer to use sparse data for distributed work.

What is a "large" matrix? You may find that you really don't need to go to
the effort of using Hadoop.

Lance

On Sat, Nov 19, 2011 at 3:07 PM, Stephen Boesch <ja...@gmail.com> wrote:

> Hi,
>   there are two solutions suggested that take advantage of either (a) a
> vector x matrix (your CF / Mahout example )  or (b) a small matrix x large
> matrix (an earlier suggestion of putting the  small matrix into the
> Distributed Cache).  Not clear yet on good approaches of (c)  large matrix
> x large matrix.
>
>
> 2011/11/19 <be...@gmail.com>
>
> > Hey Mike
> >          In mahout one place where   matrix multiplication is used is in
> >  Collaborative Filtering distributed implementation. The recommendations
> > here are generated by the multiplication of a cooccurence matrix with a
> > user vector. This user vector is treated as a single column matrix and
> then
> > the matrix multiplication takes place in there.
> >
> > Regards
> > Bejoy K S
> >
> > -----Original Message-----
> > From: Mike Spreitzer <ms...@us.ibm.com>
> > Date: Fri, 18 Nov 2011 14:52:05
> > To: <co...@hadoop.apache.org>
> > Reply-To: common-user@hadoop.apache.org
> > Subject: RE: Matrix multiplication in Hadoop
> >
> > Well, this mismatch may tell me something interesting about Hadoop.
> Matrix
> > multiplication has a lot of inherent parallelism, so from very crude
> > considerations it is not obvious that there should be a mismatch.  Why is
> > matrix multiplication ill-suited for Hadoop?
> >
> > BTW, I looked into the Mahout documentation some, and did not find matrix
> > multiplication there.  It might be hidden inside one of the advertised
> > algorithms; I looked at the documentation for a few, but did not notice
> > mention of MM.
> >
> > Thanks,
> > Mike
> >
> >
> >
> > From:   Michael Segel <mi...@hotmail.com>
> > To:     <co...@hadoop.apache.org>
> > Date:   11/18/2011 01:49 PM
> > Subject:        RE: Matrix multiplication in Hadoop
> >
> >
> >
> >
> > Ok Mike,
> >
> > First I admire that you are studying Hadoop.
> >
> > To answer your question... not well.
> >
> > Might I suggest that if you want to learn Hadoop, you try and find a
> > problem which can easily be broken in to a series of parallel tasks where
> > there is minimal communication requirements between each task?
> >
> > No offense, but if I could make a parallel... what you're asking is akin
> > to taking a normalized relational model and trying to run it as is in
> > HBase.
> > Yes it can be done. But not the best use of resources.
> >
> > > To: common-user@hadoop.apache.org
> > > CC: common-user@hadoop.apache.org
> > > Subject: Re: Matrix multiplication in Hadoop
> > > From: mspreitz@us.ibm.com
> > > Date: Fri, 18 Nov 2011 12:39:00 -0500
> > >
> > > That's also an interesting question, but right now I am studying Hadoop
> > > and want to know how well dense MM can be done in Hadoop.
> > >
> > > Thanks,
> > > Mike
> > >
> > >
> > >
> > > From:   Michel Segel <mi...@hotmail.com>
> > > To:     "common-user@hadoop.apache.org" <common-user@hadoop.apache.org
> >
> > > Date:   11/18/2011 12:34 PM
> > > Subject:        Re: Matrix multiplication in Hadoop
> > >
> > >
> > >
> > > Is Hadoop the best tool for doing large matrix math.
> > > Sure you can do it, but, aren't there better tools for these types of
> > > problems?
> > >
> > >
> > > Sent from a remote device. Please excuse any typos...
> > >
> > > Mike Segel
> > >
> >
> >
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: Matrix multiplication in Hadoop

Posted by Stephen Boesch <ja...@gmail.com>.

Hi,
   there are two solutions suggested that take advantage of either (a) a
vector x matrix (your CF / Mahout example )  or (b) a small matrix x large
matrix (an earlier suggestion of putting the  small matrix into the
Distributed Cache).  Not clear yet on good approaches of (c)  large matrix
x large matrix.


2011/11/19 <be...@gmail.com>

> Hey Mike
>          In mahout one place where   matrix multiplication is used is in
>  Collaborative Filtering distributed implementation. The recommendations
> here are generated by the multiplication of a cooccurence matrix with a
> user vector. This user vector is treated as a single column matrix and then
> the matrix multiplication takes place in there.
>
> Regards
> Bejoy K S
>
> -----Original Message-----
> From: Mike Spreitzer <ms...@us.ibm.com>
> Date: Fri, 18 Nov 2011 14:52:05
> To: <co...@hadoop.apache.org>
> Reply-To: common-user@hadoop.apache.org
> Subject: RE: Matrix multiplication in Hadoop
>
> Well, this mismatch may tell me something interesting about Hadoop. Matrix
> multiplication has a lot of inherent parallelism, so from very crude
> considerations it is not obvious that there should be a mismatch.  Why is
> matrix multiplication ill-suited for Hadoop?
>
> BTW, I looked into the Mahout documentation some, and did not find matrix
> multiplication there.  It might be hidden inside one of the advertised
> algorithms; I looked at the documentation for a few, but did not notice
> mention of MM.
>
> Thanks,
> Mike
>
>
>
> From:   Michael Segel <mi...@hotmail.com>
> To:     <co...@hadoop.apache.org>
> Date:   11/18/2011 01:49 PM
> Subject:        RE: Matrix multiplication in Hadoop
>
>
>
>
> Ok Mike,
>
> First I admire that you are studying Hadoop.
>
> To answer your question... not well.
>
> Might I suggest that if you want to learn Hadoop, you try and find a
> problem which can easily be broken in to a series of parallel tasks where
> there is minimal communication requirements between each task?
>
> No offense, but if I could make a parallel... what you're asking is akin
> to taking a normalized relational model and trying to run it as is in
> HBase.
> Yes it can be done. But not the best use of resources.
>
> > To: common-user@hadoop.apache.org
> > CC: common-user@hadoop.apache.org
> > Subject: Re: Matrix multiplication in Hadoop
> > From: mspreitz@us.ibm.com
> > Date: Fri, 18 Nov 2011 12:39:00 -0500
> >
> > That's also an interesting question, but right now I am studying Hadoop
> > and want to know how well dense MM can be done in Hadoop.
> >
> > Thanks,
> > Mike
> >
> >
> >
> > From:   Michel Segel <mi...@hotmail.com>
> > To:     "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> > Date:   11/18/2011 12:34 PM
> > Subject:        Re: Matrix multiplication in Hadoop
> >
> >
> >
> > Is Hadoop the best tool for doing large matrix math.
> > Sure you can do it, but, aren't there better tools for these types of
> > problems?
> >
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
>
>
>

Re: Matrix multiplication in Hadoop

Posted by be...@gmail.com.

Hey Mike
          In mahout one place where   matrix multiplication is used is in  Collaborative Filtering distributed implementation. The recommendations here are generated by the multiplication of a cooccurence matrix with a user vector. This user vector is treated as a single column matrix and then the matrix multiplication takes place in there.

Regards
Bejoy K S

-----Original Message-----
From: Mike Spreitzer <ms...@us.ibm.com>
Date: Fri, 18 Nov 2011 14:52:05 
To: <co...@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: RE: Matrix multiplication in Hadoop

Well, this mismatch may tell me something interesting about Hadoop. Matrix 
multiplication has a lot of inherent parallelism, so from very crude 
considerations it is not obvious that there should be a mismatch.  Why is 
matrix multiplication ill-suited for Hadoop?

BTW, I looked into the Mahout documentation some, and did not find matrix 
multiplication there.  It might be hidden inside one of the advertised 
algorithms; I looked at the documentation for a few, but did not notice 
mention of MM.

Thanks,
Mike

From:   Michael Segel <mi...@hotmail.com>
To:     <co...@hadoop.apache.org>
Date:   11/18/2011 01:49 PM
Subject:        RE: Matrix multiplication in Hadoop

Ok Mike,

First I admire that you are studying Hadoop. 

To answer your question... not well.

Might I suggest that if you want to learn Hadoop, you try and find a 
problem which can easily be broken in to a series of parallel tasks where 
there is minimal communication requirements between each task?

No offense, but if I could make a parallel... what you're asking is akin 
to taking a normalized relational model and trying to run it as is in 
HBase.
Yes it can be done. But not the best use of resources.

> To: common-user@hadoop.apache.org
> CC: common-user@hadoop.apache.org
> Subject: Re: Matrix multiplication in Hadoop
> From: mspreitz@us.ibm.com
> Date: Fri, 18 Nov 2011 12:39:00 -0500
> 
> That's also an interesting question, but right now I am studying Hadoop 
> and want to know how well dense MM can be done in Hadoop.
> 
> Thanks,
> Mike
> 
> 
> 
> From:   Michel Segel <mi...@hotmail.com>
> To:     "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Date:   11/18/2011 12:34 PM
> Subject:        Re: Matrix multiplication in Hadoop
> 
> 
> 
> Is Hadoop the best tool for doing large matrix math. 
> Sure you can do it, but, aren't there better tools for these types of 
> problems?
> 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
>

Re: Matrix multiplication in Hadoop

Posted by He Chen <ai...@gmail.com>.

Right, I agree with Edward Capriolo, Hadoop + GPGPU is a better choice.



On Sat, Nov 19, 2011 at 10:53 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Sounds like a job for next gen map reduce native libraries and gpu's. A
> modern day Dr frankenstein for sure.
>
> On Saturday, November 19, 2011, Tim Broberg <Ti...@exar.com> wrote:
> > Perhaps this is a good candidate for a native library, then?
> >
> > ________________________________________
> > From: Mike Davis [xmikedavis@gmail.com]
> > Sent: Friday, November 18, 2011 7:39 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: Matrix multiplication in Hadoop
> >
> > On Friday, November 18, 2011, Mike Spreitzer <ms...@us.ibm.com>
> wrote:
> >>  Why is matrix multiplication ill-suited for Hadoop?
> >
> > IMHO, a huge issue here is the JVM's inability to fully support cpu
> vendor
> > specific SIMD instructions and, by extension, optimized BLAS routines.
> > Running a large MM task using intel's MKL rather than relying on generic
> > compiler optimization is orders of magnitude faster on a single multicore
> > processor. I see almost no way that Hadoop could win such a CPU intensive
> > task against an mpi cluster with even a tenth of the nodes running with a
> > decently tuned BLAS library. Racing even against a single CPU might be
> > difficult, given the i/o overhead.
> >
> > Still, it's a reasonably common problem and we shouldn't murder the good
> in
> > favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> > mediocre performance, wrt C, would get used.
> >
> > --
> > Mike Davis
> >
> > The information and any attached documents contained in this message
> > may be confidential and/or legally privileged.  The message is
> > intended solely for the addressee(s).  If you are not the intended
> > recipient, you are hereby notified that any use, dissemination, or
> > reproduction is strictly prohibited and may be unlawful.  If you are
> > not the intended recipient, please contact the sender immediately by
> > return e-mail and destroy all copies of the original message.
> >
>

Re: Matrix multiplication in Hadoop

Posted by Michel Segel <mi...@hotmail.com>.

You really don't need to wait...

If you're going to go down this path you can use a jni wrapper to do the c/c++ code for the gpu...
You can do that now...

If you want to go beyond the 1D you can do it but you have to get a bit creative... but it's doable...


Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 19, 2011, at 10:53 AM, Edward Capriolo <ed...@gmail.com> wrote:

> Sounds like a job for next gen map reduce native libraries and gpu's. A
> modern day Dr frankenstein for sure.
> 
> On Saturday, November 19, 2011, Tim Broberg <Ti...@exar.com> wrote:
>> Perhaps this is a good candidate for a native library, then?
>> 
>> ________________________________________
>> From: Mike Davis [xmikedavis@gmail.com]
>> Sent: Friday, November 18, 2011 7:39 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Matrix multiplication in Hadoop
>> 
>> On Friday, November 18, 2011, Mike Spreitzer <ms...@us.ibm.com> wrote:
>>> Why is matrix multiplication ill-suited for Hadoop?
>> 
>> IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
>> specific SIMD instructions and, by extension, optimized BLAS routines.
>> Running a large MM task using intel's MKL rather than relying on generic
>> compiler optimization is orders of magnitude faster on a single multicore
>> processor. I see almost no way that Hadoop could win such a CPU intensive
>> task against an mpi cluster with even a tenth of the nodes running with a
>> decently tuned BLAS library. Racing even against a single CPU might be
>> difficult, given the i/o overhead.
>> 
>> Still, it's a reasonably common problem and we shouldn't murder the good
> in
>> favor of the best. I'm certain a MM/LinAlg Hadoop library with even
>> mediocre performance, wrt C, would get used.
>> 
>> --
>> Mike Davis
>> 
>> The information and any attached documents contained in this message
>> may be confidential and/or legally privileged.  The message is
>> intended solely for the addressee(s).  If you are not the intended
>> recipient, you are hereby notified that any use, dissemination, or
>> reproduction is strictly prohibited and may be unlawful.  If you are
>> not the intended recipient, please contact the sender immediately by
>> return e-mail and destroy all copies of the original message.
>>

Re: Matrix multiplication in Hadoop

Posted by Edward Capriolo <ed...@gmail.com>.

Sounds like a job for next gen map reduce native libraries and gpu's. A
modern day Dr frankenstein for sure.

On Saturday, November 19, 2011, Tim Broberg <Ti...@exar.com> wrote:
> Perhaps this is a good candidate for a native library, then?
>
> ________________________________________
> From: Mike Davis [xmikedavis@gmail.com]
> Sent: Friday, November 18, 2011 7:39 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Matrix multiplication in Hadoop
>
> On Friday, November 18, 2011, Mike Spreitzer <ms...@us.ibm.com> wrote:
>>  Why is matrix multiplication ill-suited for Hadoop?
>
> IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
> specific SIMD instructions and, by extension, optimized BLAS routines.
> Running a large MM task using intel's MKL rather than relying on generic
> compiler optimization is orders of magnitude faster on a single multicore
> processor. I see almost no way that Hadoop could win such a CPU intensive
> task against an mpi cluster with even a tenth of the nodes running with a
> decently tuned BLAS library. Racing even against a single CPU might be
> difficult, given the i/o overhead.
>
> Still, it's a reasonably common problem and we shouldn't murder the good
in
> favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> mediocre performance, wrt C, would get used.
>
> --
> Mike Davis
>
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.
>

Re: Matrix multiplication in Hadoop

Posted by Tommaso Teofili <to...@gmail.com>.

I agree Hama (and BSP model) could be a good option, plus Hama also
supports MR nextgen now [1].
I know MM has been implemented with Hama in the past so it may be worth
asking on the mailing list.

My 2 cents,
Tommaso

[1] : http://svn.apache.org/repos/asf/incubator/hama/trunk/yarn/


2011/11/19 He Chen <ai...@gmail.com>

> Did you try Hama?
>
> There are may methods.
>
> 1) use Hadoop MPI which allows you use MPI MM code based on Hadoop;
>
> 2) Hama is designed for MM
>
> 3) Use pure Hadoop Java MapReduce;
>
> I did this before but may not be optimal algorithm. Put your first matrix
> in DistributedCache and take second matrix line as inputsplit. For each
> line, use a mapper to let a array multply the first matrix in
> DistributedCache. Use reducer to collect the result matrix. This algorithm
> is limited by your DistributedCache size. It is suitable for a small matrix
> to multiply a huge matrix.
>
> Chen
> On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg <Ti...@exar.com>
> wrote:
>
> > Perhaps this is a good candidate for a native library, then?
> >
> > ________________________________________
> > From: Mike Davis [xmikedavis@gmail.com]
> > Sent: Friday, November 18, 2011 7:39 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: Matrix multiplication in Hadoop
> >
> >  On Friday, November 18, 2011, Mike Spreitzer <ms...@us.ibm.com>
> wrote:
> > >  Why is matrix multiplication ill-suited for Hadoop?
> >
> > IMHO, a huge issue here is the JVM's inability to fully support cpu
> vendor
> > specific SIMD instructions and, by extension, optimized BLAS routines.
> > Running a large MM task using intel's MKL rather than relying on generic
> > compiler optimization is orders of magnitude faster on a single multicore
> > processor. I see almost no way that Hadoop could win such a CPU intensive
> > task against an mpi cluster with even a tenth of the nodes running with a
> > decently tuned BLAS library. Racing even against a single CPU might be
> > difficult, given the i/o overhead.
> >
> > Still, it's a reasonably common problem and we shouldn't murder the good
> in
> > favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> > mediocre performance, wrt C, would get used.
> >
> > --
> > Mike Davis
> >
> > The information and any attached documents contained in this message
> > may be confidential and/or legally privileged.  The message is
> > intended solely for the addressee(s).  If you are not the intended
> > recipient, you are hereby notified that any use, dissemination, or
> > reproduction is strictly prohibited and may be unlawful.  If you are
> > not the intended recipient, please contact the sender immediately by
> > return e-mail and destroy all copies of the original message.
> >
>

Re: Matrix multiplication in Hadoop

Posted by He Chen <ai...@gmail.com>.

Did you try Hama?

There are may methods.

1) use Hadoop MPI which allows you use MPI MM code based on Hadoop;

2) Hama is designed for MM

3) Use pure Hadoop Java MapReduce;

I did this before but may not be optimal algorithm. Put your first matrix
in DistributedCache and take second matrix line as inputsplit. For each
line, use a mapper to let a array multply the first matrix in
DistributedCache. Use reducer to collect the result matrix. This algorithm
is limited by your DistributedCache size. It is suitable for a small matrix
to multiply a huge matrix.

Chen
On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg <Ti...@exar.com> wrote:

> Perhaps this is a good candidate for a native library, then?
>
> ________________________________________
> From: Mike Davis [xmikedavis@gmail.com]
> Sent: Friday, November 18, 2011 7:39 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Matrix multiplication in Hadoop
>
>  On Friday, November 18, 2011, Mike Spreitzer <ms...@us.ibm.com> wrote:
> >  Why is matrix multiplication ill-suited for Hadoop?
>
> IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
> specific SIMD instructions and, by extension, optimized BLAS routines.
> Running a large MM task using intel's MKL rather than relying on generic
> compiler optimization is orders of magnitude faster on a single multicore
> processor. I see almost no way that Hadoop could win such a CPU intensive
> task against an mpi cluster with even a tenth of the nodes running with a
> decently tuned BLAS library. Racing even against a single CPU might be
> difficult, given the i/o overhead.
>
> Still, it's a reasonably common problem and we shouldn't murder the good in
> favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> mediocre performance, wrt C, would get used.
>
> --
> Mike Davis
>
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.
>

RE: Matrix multiplication in Hadoop

Posted by Tim Broberg <Ti...@exar.com>.

Perhaps this is a good candidate for a native library, then?

________________________________________
From: Mike Davis [xmikedavis@gmail.com]
Sent: Friday, November 18, 2011 7:39 PM
To: common-user@hadoop.apache.org
Subject: Re: Matrix multiplication in Hadoop

On Friday, November 18, 2011, Mike Spreitzer <ms...@us.ibm.com> wrote:
>  Why is matrix multiplication ill-suited for Hadoop?

IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
specific SIMD instructions and, by extension, optimized BLAS routines.
Running a large MM task using intel's MKL rather than relying on generic
compiler optimization is orders of magnitude faster on a single multicore
processor. I see almost no way that Hadoop could win such a CPU intensive
task against an mpi cluster with even a tenth of the nodes running with a
decently tuned BLAS library. Racing even against a single CPU might be
difficult, given the i/o overhead.

Still, it's a reasonably common problem and we shouldn't murder the good in
favor of the best. I'm certain a MM/LinAlg Hadoop library with even
mediocre performance, wrt C, would get used.

--
Mike Davis

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Re: Matrix multiplication in Hadoop

Posted by Mike Davis <xm...@gmail.com>.

On Friday, November 18, 2011, Mike Spreitzer <ms...@us.ibm.com> wrote:
>  Why is matrix multiplication ill-suited for Hadoop?

IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
specific SIMD instructions and, by extension, optimized BLAS routines.
Running a large MM task using intel's MKL rather than relying on generic
compiler optimization is orders of magnitude faster on a single multicore
processor. I see almost no way that Hadoop could win such a CPU intensive
task against an mpi cluster with even a tenth of the nodes running with a
decently tuned BLAS library. Racing even against a single CPU might be
difficult, given the i/o overhead.

Still, it's a reasonably common problem and we shouldn't murder the good in
favor of the best. I'm certain a MM/LinAlg Hadoop library with even
mediocre performance, wrt C, would get used.

--
Mike Davis

RE: Matrix multiplication in Hadoop

Posted by Mike Spreitzer <ms...@us.ibm.com>.

Well, this mismatch may tell me something interesting about Hadoop. Matrix 
multiplication has a lot of inherent parallelism, so from very crude 
considerations it is not obvious that there should be a mismatch.  Why is 
matrix multiplication ill-suited for Hadoop?

BTW, I looked into the Mahout documentation some, and did not find matrix 
multiplication there.  It might be hidden inside one of the advertised 
algorithms; I looked at the documentation for a few, but did not notice 
mention of MM.

Thanks,
Mike



From:   Michael Segel <mi...@hotmail.com>
To:     <co...@hadoop.apache.org>
Date:   11/18/2011 01:49 PM
Subject:        RE: Matrix multiplication in Hadoop




Ok Mike,

First I admire that you are studying Hadoop. 

To answer your question... not well.

Might I suggest that if you want to learn Hadoop, you try and find a 
problem which can easily be broken in to a series of parallel tasks where 
there is minimal communication requirements between each task?

No offense, but if I could make a parallel... what you're asking is akin 
to taking a normalized relational model and trying to run it as is in 
HBase.
Yes it can be done. But not the best use of resources.

> To: common-user@hadoop.apache.org
> CC: common-user@hadoop.apache.org
> Subject: Re: Matrix multiplication in Hadoop
> From: mspreitz@us.ibm.com
> Date: Fri, 18 Nov 2011 12:39:00 -0500
> 
> That's also an interesting question, but right now I am studying Hadoop 
> and want to know how well dense MM can be done in Hadoop.
> 
> Thanks,
> Mike
> 
> 
> 
> From:   Michel Segel <mi...@hotmail.com>
> To:     "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Date:   11/18/2011 12:34 PM
> Subject:        Re: Matrix multiplication in Hadoop
> 
> 
> 
> Is Hadoop the best tool for doing large matrix math. 
> Sure you can do it, but, aren't there better tools for these types of 
> problems?
> 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
>

RE: Matrix multiplication in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Ok Mike,

First I admire that you are studying Hadoop. 

To answer your question... not well.

Might I suggest that if you want to learn Hadoop, you try and find a problem which can easily be broken in to a series of parallel tasks where there is minimal communication requirements between each task?

No offense, but if I could make a parallel... what you're asking is akin to taking a normalized relational model and trying to run it as is in HBase.
Yes it can be done. But not the best use of resources.

> To: common-user@hadoop.apache.org
> CC: common-user@hadoop.apache.org
> Subject: Re: Matrix multiplication in Hadoop
> From: mspreitz@us.ibm.com
> Date: Fri, 18 Nov 2011 12:39:00 -0500
> 
> That's also an interesting question, but right now I am studying Hadoop 
> and want to know how well dense MM can be done in Hadoop.
> 
> Thanks,
> Mike
> 
> 
> 
> From:   Michel Segel <mi...@hotmail.com>
> To:     "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Date:   11/18/2011 12:34 PM
> Subject:        Re: Matrix multiplication in Hadoop
> 
> 
> 
> Is Hadoop the best tool for doing large matrix math. 
> Sure you can do it, but, aren't there better tools for these types of 
> problems?
> 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
>

Re: Matrix multiplication in Hadoop

Posted by Mike Spreitzer <ms...@us.ibm.com>.

That's also an interesting question, but right now I am studying Hadoop 
and want to know how well dense MM can be done in Hadoop.

Thanks,
Mike

From:   Michel Segel <mi...@hotmail.com>
To:     "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
Date:   11/18/2011 12:34 PM
Subject:        Re: Matrix multiplication in Hadoop

Is Hadoop the best tool for doing large matrix math. 
Sure you can do it, but, aren't there better tools for these types of 
problems?

Sent from a remote device. Please excuse any typos...

Mike Segel

Re: Matrix multiplication in Hadoop

Posted by Edward Capriolo <ed...@gmail.com>.

A problem with matrix multiplication in hadoop is that hadoop is row
oriented for the most part. I have thought about this use case however and
you can theoretically turn a 2D matrix into a 1D matrix and then that fits
into the row oriented nature of hadoop. Also being that the typical mapper
can have fairly large chunks of memory like 1024MB I have done work like
this before were I loaded such datasets into memory to process them. That
usage does not really fit the map reduce model.

I have been wanting to look at:
http://www.scidb.org/

Edward
On Fri, Nov 18, 2011 at 1:48 PM, Ayon Sinha <ay...@yahoo.com> wrote:

> I'd really be interested in a comparison of Numpy/Octave/Matlab kind of
> tools with a Hadoop (lets say 4-10 large cloud servers) implementation with
> growing size of the matrix. I want to know the scale at which Hadoop really
> starts to pull away.
>
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________________________________
> From: Michel Segel <mi...@hotmail.com>
> To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Sent: Friday, November 18, 2011 9:33 AM
> Subject: Re: Matrix multiplication in Hadoop
>
> Is Hadoop the best tool for doing large matrix math.
> Sure you can do it, but, aren't there better tools for these types of
> problems?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <ms...@us.ibm.com> wrote:
>
> > Who is doing multiplication of large dense matrices using Hadoop?  What
> is
> > a good way to do that computation using Hadoop?
> >
> > Thanks,
> > Mike

Re: Matrix multiplication in Hadoop

Posted by Ayon Sinha <ay...@yahoo.com>.

I'd really be interested in a comparison of Numpy/Octave/Matlab kind of tools with a Hadoop (lets say 4-10 large cloud servers) implementation with growing size of the matrix. I want to know the scale at which Hadoop really starts to pull away. 

-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.

________________________________
From: Michel Segel <mi...@hotmail.com>
To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
Sent: Friday, November 18, 2011 9:33 AM
Subject: Re: Matrix multiplication in Hadoop

Is Hadoop the best tool for doing large matrix math. 
Sure you can do it, but, aren't there better tools for these types of problems?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <ms...@us.ibm.com> wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is 
> a good way to do that computation using Hadoop?
> 
> Thanks,
> Mike

Re: Matrix multiplication in Hadoop

Posted by Michel Segel <mi...@hotmail.com>.

Is Hadoop the best tool for doing large matrix math. 
Sure you can do it, but, aren't there better tools for these types of problems?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <ms...@us.ibm.com> wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is 
> a good way to do that computation using Hadoop?
> 
> Thanks,
> Mike

Re: Matrix multiplication in Hadoop

Posted by John Conwell <jo...@iamjohn.me>.

I'm not sure, but I would suspect that Mahout has some low level map/reduce
jobs for this.  You might start there.

On Fri, Nov 18, 2011 at 8:59 AM, Mike Spreitzer <ms...@us.ibm.com> wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is
> a good way to do that computation using Hadoop?
>
> Thanks,
> Mike

-- 

Thanks,
John C