You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Marc Sturlese <ma...@gmail.com> on 2011/10/01 12:18:34 UTC

Re: about DistributedRowMatrix implementation

Well after digging into the code and do some tests, I've seen that what I was
asking for is not possible. Mahout will only let you do a distributed matrix
multiplication of 2 sparse matrix, as the representation of a whole row or
column has to feed in memory. Actually have to feed in memory a row and a
column each time (as it uses the CompositeInputFormat). 
To do dense matrix multiplication with hadoop just found this:
http://homepage.mac.com/j.norstad/matrix-multiply/index.html 
But the data generated by the maps will be extremely huge and the job will
take ages (of course depending of the number of nodes).
I've seed around that Hama and R are possible solutions too. Any advice,
comment or experience?


--
View this message in context: http://lucene.472066.n3.nabble.com/about-DistributedRowMatrix-implementation-tp3375372p3384669.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: about DistributedRowMatrix implementation

Posted by Dmitriy Lyubimov <dl...@apache.org>.

Although I sense the discussion is really a bit more than just about reading
inputs one element at a time.

Yes, I guess multiplication is generally 2 passes unless it map side join
which I think though has more interesting prerequisites for the input that a
general drm assumes, I think. I thought map side joins require same sort and
partitioning and drm doesn't assume that in most general case? Although I
have a pretty vague idea how exactly that particular input format does what
it does. It is not supported in the new API and I felt I wanted to abstain
from going back to the deprecated stuff just to have that.

Alright, please never mind.
On Oct 1, 2011 3:45 PM, "Dmitriy Lyubimov" <dl...@apache.org> wrote:
> I have a branch in github that equips vectorWritable with a preprocessor
via
> a Cinfigurable hadoop interface and happily preprocess input element by
> element without creating any heap object in memory.
>
> I proposed to contribute that approach a year ago but it was rejected,
> afaik, on the grounds that push style preprocessor is a "bad" or
> "confusing" pattern to have.
>
> If you want, I can dig that patch out for judgement again.
>
> The benefits if this patch are significant. For once, unbounding width of
> the input for memory, reducing garbage collector pressure, not having to
> have a lot of memory( actually, any extra heap memory) for wide
matrices...
> it makes sense all around anywhere you look at it. Except for "bad"
pattern.
>
> One thing it is though without a doubt, is that it is totally possible(
and
> actually the version of ssvd we were using rans exactly on that
> projection-as-a-single-element-preprocessor pattern).
> On Oct 1, 2011 3:43 PM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
>> I have a branch in github that equips vectorWritable with a preprocessor
> via
>> a Cinfigurable hadoop interface and happily preprocess input element by
>> element without creating any heap object in memory.
>>
>> I proposed to contribute that approach a year ago but it was rejected,
>> afaik, on the grounds that push style preprocessor is a "bad" or
>> "confusing" pattern to have.
>>
>> If you want, I can dig that patch out for judgement again.
>>
>> The benefits if this patch are significant. For once, unbounding width of
>> the input for memory, reducing garbage collector pressure, not having to
>> have a lot of memory( actually, any extra heap memory) for wide
> matrices...
>> it makes sense all around anywhere you look at it. Except for "bad"
> pattern.
>>
>> One thing it is though without a doubt, is that it is totally possible(
> and
>> actually the version of ssvd we were using rans exactly on that
>> projection-as-a-single-element-preprocessor pattern).
>>
>> Sent from android tab
>> On Oct 1, 2011 10:42 AM, "Jake Mannix" <ja...@gmail.com> wrote:
>>> Marc,
>>>
>>> If you want to do element-at-a-time multiplication, without putting both
>>> row and
>>> column in memory at a time, this is totally doable, but just not
>>> implemented
>>> in Mahout yet. The current implementation manages to do it in one
>>> map-reduce
>>> pass by doing a mapside join (the CompositeInputFormat thing), but in
>>> general
>>> if you don't do a map-side join, it's 2 passes. In which case, doing
this
>>> element at a time instead of row/column at a time is also 2 passes, and
>>> has no restrictions on how much is in memory at a time.
>>>
>>> I've had some code lying around which started on doing this, but never
>>> had a need just yet. If you open up a JIRA ticket for this, I could post
>>> my code fragments so far, and maybe you (or someone else) could help
>>> finish it off.
>>>
>>> Can you describe a bit about how big your matrices are? Dense matrix
>>> multiplication is an O(N^3) operation, so if N is too large so that even
>>> one row or column cannot fit in memory, then N^3 is not going to finish
>>> any time this year or next, from what I can tell.
>>>
>>> -jake
>>>
>>> On Sat, Oct 1, 2011 at 3:18 AM, Marc Sturlese <marc.sturlese@gmail.com
>>>wrote:
>>>
>>>> Well after digging into the code and do some tests, I've seen that what
> I
>>>> was
>>>> asking for is not possible. Mahout will only let you do a distributed
>>>> matrix
>>>> multiplication of 2 sparse matrix, as the representation of a whole row
>> or
>>>> column has to feed in memory. Actually have to feed in memory a row and
> a
>>>> column each time (as it uses the CompositeInputFormat).
>>>> To do dense matrix multiplication with hadoop just found this:
>>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html
>>>> But the data generated by the maps will be extremely huge and the job
>> will
>>>> take ages (of course depending of the number of nodes).
>>>> I've seed around that Hama and R are possible solutions too. Any
advice,
>>>> comment or experience?
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>
>
http://lucene.472066.n3.nabble.com/about-DistributedRowMatrix-implementation-tp3375372p3384669.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>

Re: about DistributedRowMatrix implementation

Posted by Dmitriy Lyubimov <dl...@apache.org>.

I have a branch in github that equips vectorWritable with a preprocessor via
a Cinfigurable hadoop interface and happily preprocess input element by
element without creating any heap object in memory.

I proposed to contribute that approach a year ago but it was rejected,
afaik, on the grounds that push style preprocessor is  a "bad" or
"confusing" pattern to have.

If you want, I can dig that patch out for judgement again.

The benefits if this patch are significant. For once, unbounding width of
the input for memory, reducing garbage collector pressure, not having to
have a lot of memory( actually, any extra heap memory) for wide matrices...
it makes sense all around anywhere you look at it. Except for "bad" pattern.

One thing it is though without a doubt, is that it is totally possible( and
actually the version of ssvd we were using rans exactly on that
projection-as-a-single-element-preprocessor pattern).
On Oct 1, 2011 3:43 PM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:
> I have a branch in github that equips vectorWritable with a preprocessor
via
> a Cinfigurable hadoop interface and happily preprocess input element by
> element without creating any heap object in memory.
>
> I proposed to contribute that approach a year ago but it was rejected,
> afaik, on the grounds that push style preprocessor is a "bad" or
> "confusing" pattern to have.
>
> If you want, I can dig that patch out for judgement again.
>
> The benefits if this patch are significant. For once, unbounding width of
> the input for memory, reducing garbage collector pressure, not having to
> have a lot of memory( actually, any extra heap memory) for wide
matrices...
> it makes sense all around anywhere you look at it. Except for "bad"
pattern.
>
> One thing it is though without a doubt, is that it is totally possible(
and
> actually the version of ssvd we were using rans exactly on that
> projection-as-a-single-element-preprocessor pattern).
>
> Sent from android tab
> On Oct 1, 2011 10:42 AM, "Jake Mannix" <ja...@gmail.com> wrote:
>> Marc,
>>
>> If you want to do element-at-a-time multiplication, without putting both
>> row and
>> column in memory at a time, this is totally doable, but just not
>> implemented
>> in Mahout yet. The current implementation manages to do it in one
>> map-reduce
>> pass by doing a mapside join (the CompositeInputFormat thing), but in
>> general
>> if you don't do a map-side join, it's 2 passes. In which case, doing this
>> element at a time instead of row/column at a time is also 2 passes, and
>> has no restrictions on how much is in memory at a time.
>>
>> I've had some code lying around which started on doing this, but never
>> had a need just yet. If you open up a JIRA ticket for this, I could post
>> my code fragments so far, and maybe you (or someone else) could help
>> finish it off.
>>
>> Can you describe a bit about how big your matrices are? Dense matrix
>> multiplication is an O(N^3) operation, so if N is too large so that even
>> one row or column cannot fit in memory, then N^3 is not going to finish
>> any time this year or next, from what I can tell.
>>
>> -jake
>>
>> On Sat, Oct 1, 2011 at 3:18 AM, Marc Sturlese <marc.sturlese@gmail.com
>>wrote:
>>
>>> Well after digging into the code and do some tests, I've seen that what
I
>>> was
>>> asking for is not possible. Mahout will only let you do a distributed
>>> matrix
>>> multiplication of 2 sparse matrix, as the representation of a whole row
> or
>>> column has to feed in memory. Actually have to feed in memory a row and
a
>>> column each time (as it uses the CompositeInputFormat).
>>> To do dense matrix multiplication with hadoop just found this:
>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html
>>> But the data generated by the maps will be extremely huge and the job
> will
>>> take ages (of course depending of the number of nodes).
>>> I've seed around that Hama and R are possible solutions too. Any advice,
>>> comment or experience?
>>>
>>>
>>> --
>>> View this message in context:
>>>
>
http://lucene.472066.n3.nabble.com/about-DistributedRowMatrix-implementation-tp3375372p3384669.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>

Re: about DistributedRowMatrix implementation

Posted by Jake Mannix <ja...@gmail.com>.

Marc,

  If you want to do element-at-a-time multiplication, without putting both
row and
column in memory at a time, this is totally doable, but just not
implemented
in Mahout yet.  The current implementation manages to do it in one
map-reduce
pass by doing a mapside join (the CompositeInputFormat thing), but in
general
if you don't do a map-side join, it's 2 passes.  In which case, doing this
element at a time instead of row/column at a time is also 2 passes, and
has no restrictions on how much is in memory at a time.

  I've had some code lying around which started on doing this, but never
had a need just yet.  If you open up a JIRA ticket for this, I could post
my code fragments so far, and maybe you (or someone else) could help
finish it off.

  Can you describe a bit about how big your matrices are?  Dense matrix
multiplication is an O(N^3) operation, so if N is too large so that even
one row or column cannot fit in memory, then N^3 is not going to finish
any time this year or next, from what I can tell.

  -jake

On Sat, Oct 1, 2011 at 3:18 AM, Marc Sturlese <ma...@gmail.com>wrote:

> Well after digging into the code and do some tests, I've seen that what I
> was
> asking for is not possible. Mahout will only let you do a distributed
> matrix
> multiplication of 2 sparse matrix, as the representation of a whole row or
> column has to feed in memory. Actually have to feed in memory a row and a
> column each time (as it uses the CompositeInputFormat).
> To do dense matrix multiplication with hadoop just found this:
> http://homepage.mac.com/j.norstad/matrix-multiply/index.html
> But the data generated by the maps will be extremely huge and the job will
> take ages (of course depending of the number of nodes).
> I've seed around that Hama and R are possible solutions too. Any advice,
> comment or experience?
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/about-DistributedRowMatrix-implementation-tp3375372p3384669.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>