You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Milan Simonovic <am...@gmail.com> on 2007/12/28 21:52:27 UTC

Matrix multiplication

Hi all,

  I'm trying to implement a clustering algorithm on hadoop. Among
  other things, there're a lot of matrix multiplications. LINA
  (http://wiki.apache.org/lucene-hadoop/Lina) is probably going to be
  a perfect fit here, but I can't afford to wait. Btw, I can't find
  HADOOP-1655 any more, what's going on?

  Using the ordinary matrix product (sum of row by column products
  gives one element from the resulting matrix), the easiest way to
  formulate this computation is to have one row and one column sent to
  a mapper and the output would be one element from the resulting
  matrix. Reducer can take this element and put it into the correct
  position in the output file.

  I need your advice on how to design input file(s) and how to make
  input splits then. I'd like to have matrices in separate files
  (they'll be used for more than one multiplication, and it's cleaner
  to have them separate).

  I guess then I'd have to use MultiFileSplit and MultiFileInputFormat
  somehow. Is it possible at all to send two records (one row and
  one column, or two rows if the other matrix is column-oriented
  ordered) from two input splits to a single mapper? Or should I look
  for an alternative way to multiply matrixes?


-- 
regards,
Milan


RE: Re[2]: Matrix multiplication

Posted by edward yoon <we...@udanax.org>.
I (with my team) made an API as a common matrix toolkit.
(It will be perform the matrix operations using map/reduce on hadoop + javaspace).
A test was done on a large cluster.

Let's start on monday.

B. Regards,
Edward yoon.


> Date: Sat, 29 Dec 2007 11:08:17 +0100
> From: amavisto@gmail.com
> To: hadoop-user@lucene.apache.org
> Subject: Re[2]: Matrix multiplication
>
>
> Good morning Ed,
>
> thanks for your replay. I'm sorry to hear your contrib is not
> accepted. Please send me a link to your svn, I'd like to try it.
> Btw what's the current status, do you have a working (proof-of-concept
> as you said) version?
>
> --
> regards,
> Milan
>

_________________________________________________________________
Get the power of Windows + Web with the new Windows Live.
http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007

Re[2]: Matrix multiplication

Posted by Milan Simonovic <am...@gmail.com>.
Good morning Ed,

thanks for your replay. I'm sorry to hear your contrib is not
accepted. Please send me a link to your svn, I'd like to try it.
Btw what's the current status, do you have a working (proof-of-concept
as you said) version?

-- 
regards,
 Milan         


RE: Matrix multiplication

Posted by edward yoon <we...@udanax.org>.
Hello, ed yoon.

I'm afraid I heard that a Matrix & Linear Algebra proposal was not fit for the hbase project from hbase committers.
So, I'd like to just move the Matrix & Linear Algebra proposal over to hadoop after architecture transformation.

If it more complete and, at the same time, better organized,
I belive that we'll get another chance of continuing good contribute.

I'm not a hadoop committer, so i use the other svn.
If you want to see the my svn, please mail me via private e-mail.

B. Regards,
Edward yoon @ NHN, corp.
Home : http://www.udanax.org


> Date: Fri, 28 Dec 2007 21:52:27 +0100
> From: amavisto@gmail.com
> To: hadoop-user@lucene.apache.org
> Subject: Matrix multiplication
>
> Hi all,
>
> I'm trying to implement a clustering algorithm on hadoop. Among
> other things, there're a lot of matrix multiplications. LINA
> (http://wiki.apache.org/lucene-hadoop/Lina) is probably going to be
> a perfect fit here, but I can't afford to wait. Btw, I can't find
> HADOOP-1655 any more, what's going on?
>
> Using the ordinary matrix product (sum of row by column products
> gives one element from the resulting matrix), the easiest way to
> formulate this computation is to have one row and one column sent to
> a mapper and the output would be one element from the resulting
> matrix. Reducer can take this element and put it into the correct
> position in the output file.
>
> I need your advice on how to design input file(s) and how to make
> input splits then. I'd like to have matrices in separate files
> (they'll be used for more than one multiplication, and it's cleaner
> to have them separate).
>
> I guess then I'd have to use MultiFileSplit and MultiFileInputFormat
> somehow. Is it possible at all to send two records (one row and
> one column, or two rows if the other matrix is column-oriented
> ordered) from two input splits to a single mapper? Or should I look
> for an alternative way to multiply matrixes?
>
>
> --
> regards,
> Milan
>

_________________________________________________________________
i’m is proud to present Cause Effect, a series about real people making a difference.
http://im.live.com/Messenger/IM/MTV/?source=text_Cause_Effect

RE: Re[4]: Matrix multiplication

Posted by edward yoon <we...@udanax.org>.
Dear, Ted.

I'd like to contribute the large-scale matrix system project to hadoop, which uses MapFile or Hbase storage + Map/Reduce to perform the Matrix operations.
Can it become a contrib-project?

Please let me know.



B. Regards,
Edward yoon.

----------------------------------------
> Date: Sat, 29 Dec 2007 14:37:01 -0800
> Subject: Re: Re[4]: Matrix multiplication
> From: tdunning@veoh.com
> To: hadoop-user@lucene.apache.org
> 
> 
> I figured.
> 
> 
> On 12/29/07 7:03 AM, "Milan Simonovic"  wrote:
> 
>> 
>> That's what I wanted to say :) my mistake
>> 
>> Saturday, December 29, 2007, 3:55:31 PM, you wrote:
>> 
>>> Actually, this isn't true (you must know this).  Each element is multiplied
>>> by every element of the corresponding row or column of the other matrix.
>>> This is (thankfully) much less communication.
> 

_________________________________________________________________
Get the power of Windows + Web with the new Windows Live.
http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007

Re: Re[4]: Matrix multiplication

Posted by Ted Dunning <td...@veoh.com>.
I figured.


On 12/29/07 7:03 AM, "Milan Simonovic" <am...@gmail.com> wrote:

> 
> That's what I wanted to say :) my mistake
> 
> Saturday, December 29, 2007, 3:55:31 PM, you wrote:
> 
>> Actually, this isn't true (you must know this).  Each element is multiplied
>> by every element of the corresponding row or column of the other matrix.
>> This is (thankfully) much less communication.


Re[4]: Matrix multiplication

Posted by Milan Simonovic <am...@gmail.com>.
That's what I wanted to say :) my mistake

Saturday, December 29, 2007, 3:55:31 PM, you wrote:

> Actually, this isn't true (you must know this).  Each element is multiplied
> by every element of the corresponding row or column of the other matrix.
> This is (thankfully) much less communication.

-- 
regards,
Milan


Re: Re[2]: Matrix multiplication

Posted by Ted Dunning <td...@veoh.com>.
Actually, this isn't true (you must know this).  Each element is multiplied
by every element of the corresponding row or column of the other matrix.
This is (thankfully) much less communication.


On 12/29/07 6:48 AM, "Milan Simonovic" <am...@gmail.com> wrote:

> Ordinary matrix multiplication makes is difficult because each element
> from one matrix will by multiplied by all the elements from the other
> matrix. Unfortunately, not all problems, like word counting for
> example, are splittable in a way that data moving between nodes is
> not required.


Re: Re[2]: Matrix multiplication

Posted by Ted Dunning <td...@veoh.com>.
The most surprising thing about hadoop is the degree to which you are
exactly correct.

My feeling is that what is really happening is that the pain is moving (and
moderating) to the process of adopting map-reduce as a programming paradigm.
Once you do that, the pain is largely over.


On 12/29/07 6:48 AM, "Milan Simonovic" <am...@gmail.com> wrote:

> I was thinking that most of the time
> hadoop should protect me from worrying about data access time,
> bandwidth, etc.


Re[2]: Matrix multiplication

Posted by Milan Simonovic <am...@gmail.com>.
Ted,

thanks for your reply. This is still in an early phase of research so
I wouldn't like to spend much time on the infrastructure (I need to
sleep also ;) . The simplest possible solution that will work is ok
for now. I'll wait for Ed's implementation.

Your mail actually made me think about my perception of map-reduce
model and hadoop implementation. I was thinking that most of the time
hadoop should protect me from worrying about data access time,
bandwidth, etc. Even if that means the computation will be, lets say,
number of times slower as it would be in the optimal implementation. I
assume you're probably talking about the optimal one, or at least the
good one, and I agree with you.

Of course, hadoop can't hide this completely, I'd still have to follow
some guide lines (use optimal number of mappers/reduces, use
combiners, make splits large enough so that a mapper can work for
couple of minutes, and so on). Hadoop should try to cut down the
bandwidth (by spawning a mapper close to the data, etc).

Ordinary matrix multiplication makes is difficult because each element
from one matrix will by multiplied by all the elements from the other
matrix. Unfortunately, not all problems, like word counting for
example, are splittable in a way that data moving between nodes is
not required.

This is probably single-machine-developer inside me complaining :)
I have to consider better ways to partition my problem(s) eventually...

Again, thanks for your mail. I have few more words for you privately.

-- 
regards,
 Milan


Re: Matrix multiplication

Posted by Ted Dunning <td...@veoh.com>.
For dense matrix multiplication, the key problem is that you have O(n^3)
arithmetic operations and O(n^2) element fetches.  Most conventional
machines now have nearly 10^2 or larger ratio between the speed of the
arithmetic processor and memory so for n > 100, you should be able to
saturate the arithmetic unit of the processor with carefully written code.

Unfortunately, naively written code changes matters to require O(n^3) memory
operations with an attendant slowdown of multiple orders of magnitude.

This problem can become vastly worse on multi-processors and when you start
talking about networked clusters, then things are worse still because the
communication bandwidth is so limited.

The key therefore, is to use data elements as many times as possible.

If you use row and column decomposition and dot products as you suggest,
there is some potential for re-use, but it can be difficult.  The basic
intuition is that you have to multiply each row of the left hand matrix
against every column.  This gives you reuse of the left hand matrix
elements, but it increases the number of copies of the right hand matrix
that you have to send around which defeats the gain for the left hand
matrix.

An easier way to actually get the gain is to use block decomposition of the
input matrices.  If the input blocks are sufficiently large enough then you
will get your re-use just by using an efficient machine level multiply on
each node.  The blocks can be nearly square or can be strips, but blocks are
probably a bit easier to get full speed on.

A more serious question is whether the inherent n^2 nature of the problem
will defeat you before you really run out of parallel processing steam.  Are
you sure that you need dense matrices?

If it turns out that you don't need dense matrices then you have a very
different kind of problem.  The same sort of block decomposition strategy
that applied for dense problems can be applied to sparse problems, but you
have to be careful to balance the number of non-zero elements in the blocks.
For many information retrieval and social algorithm problems, the data is
subject to something like Zipf's law which can mean that you have very
serious non-uniformity in the distribution of non-zero elements.  The trick
for dealing with this is often to do the block decomposition of a random
permutation of the matrices which avoids the non-uniformity.

You should also remember that text retrieval engines are essentially sparse
matrix multiplication engines.  You may be able to "cheat" a bit by using an
engine such as Lucene with custom scoring functions or you might gain some
benefit or inspiration from the code.  Retrieval engines are particularly
useful when it comes to the maintenance of the large matrices since that
becomes particularly difficult with scale (probably harder than doing the
multiplication in fact).

I would love to hear more of what you are trying to do and how things turn
out for you.  Please feel free to contact me off-line for more detailed
discussions that would be of less interest to the entire group.


On 12/28/07 12:52 PM, "Milan Simonovic" <am...@gmail.com> wrote:

> Hi all,
> 
>   I'm trying to implement a clustering algorithm on hadoop. Among
>   other things, there're a lot of matrix multiplications. LINA
>   (http://wiki.apache.org/lucene-hadoop/Lina) is probably going to be
>   a perfect fit here, but I can't afford to wait. Btw, I can't find
>   HADOOP-1655 any more, what's going on?
> 
>   Using the ordinary matrix product (sum of row by column products
>   gives one element from the resulting matrix), the easiest way to
>   formulate this computation is to have one row and one column sent to
>   a mapper and the output would be one element from the resulting
>   matrix. Reducer can take this element and put it into the correct
>   position in the output file.
> 
>   I need your advice on how to design input file(s) and how to make
>   input splits then. I'd like to have matrices in separate files
>   (they'll be used for more than one multiplication, and it's cleaner
>   to have them separate).
> 
>   I guess then I'd have to use MultiFileSplit and MultiFileInputFormat
>   somehow. Is it possible at all to send two records (one row and
>   one column, or two rows if the other matrix is column-oriented
>   ordered) from two input splits to a single mapper? Or should I look
>   for an alternative way to multiply matrixes?
>