You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by kalmohsen <ka...@ahlia.edu.bh> on 2014/09/21 12:34:21 UTC

Spark RDD and Mahout DRM

I am continuously reading about Mahout, Hadoop, Spark and Scala; willing 
to be able to add value to them. However, I am confused with 2 things: 
Spark RDD and Mahout DRM.
I do know that spark’s RDD is used while working with Mahout. However, I 
came across some Scala code which is using Mahout DRM or wrapping RDD to 
DRM.

Thus, could anyone clarify the difference between them?

Thanks in advance
Regards

Re: Spark RDD and Mahout DRM

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

pardon me. the above was OLS example of course. Ridge would require a small
mod by introducing regularization rate correction to the main diagonal of
the self-squared X

val w = solve (drmX.t %*% drmX + diag(lambda, drmX.ncol), drmX.t %*% y)

On Sun, Sep 21, 2014 at 5:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> There are few things going on with DRM.
>
> First, Hadoop/MapReduce DRM in Mahout is pretty much constrained to its
> persistent format on hdfs (row-wise row key/vector pairs).
>
> When we moved to Scala, this notion received further expansion as one of
> the types under governance of R-like DSL and algebraic optimizer of such
> algebraic expressions. E.g. distributed ridge regression solution under
> such DSL for dataset represented by tall and skinny matrix X would look
> something like this:
>
> val drmX = drmFromHdfs("X")
> val y = .. (y observation vector)
>
> val w = solve (drmX.t %*% drmX, drmX.t %*% y)
>
> Finally, algebraic optimizer optimizes execution plan for a particular
> engine, one of them being Spark's RDDs. Mahout RDDs in their checkpoint
> format (e.g. fully-formed intermediate RDD result) have dual representation
> -- either row-wise (tuples of key, row vectors) or block-wise (array of
> keys -> matrix vertical/horizontal block).
>
> Finally, assuming back engine is Spark's RDDs, it is possible to wrap
> certain RDD types into DRM type, and vice versa, get access to checkpoint
> rdd (e.g. drmX.rdd automatically creates checkpoint and exports matrix data
> as an RDD).
>
> for further details, i would hope the Mahout/Spark page would make it a
> bit more clear. there's also a talk and slides from last mahout meetup
> discussing main ideas here.
>
> -d
>
>
>
>
> On Sun, Sep 21, 2014 at 3:34 AM, kalmohsen <ka...@ahlia.edu.bh> wrote:
>
>> I am continuously reading about Mahout, Hadoop, Spark and Scala; willing
>> to be able to add value to them. However, I am confused with 2 things:
>> Spark RDD and Mahout DRM.
>> I do know that spark’s RDD is used while working with Mahout. However, I
>> came across some Scala code which is using Mahout DRM or wrapping RDD to
>> DRM.
>>
>> Thus, could anyone clarify the difference between them?
>>
>> Thanks in advance
>> Regards
>>
>>
>>
>

Re: Spark RDD and Mahout DRM

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

There are few things going on with DRM.

First, Hadoop/MapReduce DRM in Mahout is pretty much constrained to its
persistent format on hdfs (row-wise row key/vector pairs).

When we moved to Scala, this notion received further expansion as one of
the types under governance of R-like DSL and algebraic optimizer of such
algebraic expressions. E.g. distributed ridge regression solution under
such DSL for dataset represented by tall and skinny matrix X would look
something like this:

val drmX = drmFromHdfs("X")
val y = .. (y observation vector)

val w = solve (drmX.t %*% drmX, drmX.t %*% y)

Finally, algebraic optimizer optimizes execution plan for a particular
engine, one of them being Spark's RDDs. Mahout RDDs in their checkpoint
format (e.g. fully-formed intermediate RDD result) have dual representation
-- either row-wise (tuples of key, row vectors) or block-wise (array of
keys -> matrix vertical/horizontal block).

Finally, assuming back engine is Spark's RDDs, it is possible to wrap
certain RDD types into DRM type, and vice versa, get access to checkpoint
rdd (e.g. drmX.rdd automatically creates checkpoint and exports matrix data
as an RDD).

for further details, i would hope the Mahout/Spark page would make it a bit
more clear. there's also a talk and slides from last mahout meetup
discussing main ideas here.

-d

On Sun, Sep 21, 2014 at 3:34 AM, kalmohsen <ka...@ahlia.edu.bh> wrote:

> I am continuously reading about Mahout, Hadoop, Spark and Scala; willing
> to be able to add value to them. However, I am confused with 2 things:
> Spark RDD and Mahout DRM.
> I do know that spark’s RDD is used while working with Mahout. However, I
> came across some Scala code which is using Mahout DRM or wrapping RDD to
> DRM.
>
> Thus, could anyone clarify the difference between them?
>
> Thanks in advance
> Regards
>
>
>

Re: Spark RDD and Mahout DRM

Posted by Ted Dunning <te...@gmail.com>.

An RDD is a spark structure which involves the in-memory storage of data on
a number of machines.

A DRM is Mahout's concept of a distributed row matrix.  This is mostly an
on-disk concept.

On Sun, Sep 21, 2014 at 3:34 AM, kalmohsen <ka...@ahlia.edu.bh> wrote:

> I am continuously reading about Mahout, Hadoop, Spark and Scala; willing
> to be able to add value to them. However, I am confused with 2 things:
> Spark RDD and Mahout DRM.
> I do know that spark’s RDD is used while working with Mahout. However, I
> came across some Scala code which is using Mahout DRM or wrapping RDD to
> DRM.
>
> Thus, could anyone clarify the difference between them?
>
> Thanks in advance
> Regards
>
>
>