You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Anand Avati <av...@gluster.org> on 2014/04/28 04:57:55 UTC

Mahout DSL vs Spark

Hi Ted, Dmitry,
Background: I am exploring the feasibility of providing H2O distributed
"backend" to the DSL. At a high level it appears that implementing physical
operators for DrmLike over H2O does not seem extremely challenging. All the
operators in the DSL seem to have at least an approximate equivalent in
H2O's own (R-like) DSL, and wiring one operator with another's
implementation seems like a tractable problem.

The reason I write, is to better understand the split between the Mahout
DSL and Spark (both current and future). As of today, the DSL seems to be
pretty tightly coupled with Spark.

E.g:

- DSSVD.scala imports o.a.spark.storage.StorageLevel
- drm.plan.CheckpointAction: the result of exec() and checkpoint() is
DrmRddInput (instead of, say, DrmLike)

Firstly, I don't think I am presenting some new revelation you guys don't
already know - I'm sure you know that the logical vs physical "split" in
the DSL is not absolute (yet).

That being said, I would like to understand if there are plans, or efforts
already underway to make the DSL (i.e how DSSVD would be written) and the
logical layer (i.e drm.plan.* optimizer etc) more "pure" and move the Spark
specific code entirely into the physical domain. I recall Dmitry mentioning
that a new engine other than Spark was also being planned, therefore I
deduce some thought for such "purification" has already been applied.

It would be nice to see changes approximately like:

Rename ./spark => ./dsl
Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
./dsl/src/main/scala/org/apache/mahout/dsl
Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
./dsl/main/scala/org/apache/mahout/dsl/spark-backend

along with appropriately renaming packages and imports, and confining
references to RDD and SparkContext completely within spark-backend.

I think such a clean split would be necessary to introduce more backend
engines. If no efforts are already underway, I would be glad to take on the
DSL "purification" task.

Thanks,
Avati

Re: Mahout DSL vs Spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
yes it is just one more variable to abstract way. to wrap into something
like MahoutContext.


On Mon, Apr 28, 2014 at 2:43 PM, Anand Avati <av...@gluster.org> wrote:

> Sebastian,
> I'm still not sure how big or small problem the implicit val is. But I will
> keep the point in the back of my mind as I explore further. I agree that
> the flexibility of backends is a powerful feature making Mahout unique and
> attractive indeed. I hope to see the flexibility be exercised purely at
> runtime.
>
>
>
> On Sun, Apr 27, 2014 at 11:11 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > Anand,
> >
> > I'd also love to see work on a cleaner separation between the DSL and
> > Spark. Another thing that should be tackled in the current code is that
> the
> > SparkContext has to be present as implicit val in some methods.
> >
> > Making the DSL run on different systems will be a powerful feature that
> > will make Mahout unique and attractive to a lot of users, as it doesn't
> > enforce a lock-in to a particular system. I've talked to a company
> recently
> > that exactly had this requirement, they decided against using Spark, but
> > would still be highly interested in running new Mahout recommenders built
> > using the DSL.
> >
> > --sebastian
> >
> >
> >
> > On 04/28/2014 05:39 AM, Anand Avati wrote:
> >
> >> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>
> >>
> >>>
> >>>
> >>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <av...@gluster.org>
> wrote:
> >>>
> >>>  Hi Ted, Dmitry,
> >>>> Background: I am exploring the feasibility of providing H2O
> distributed
> >>>> "backend" to the DSL.
> >>>>
> >>>>
> >>> Very cool. that's actually was one of my initial proposals on how to
> >>> approach this. Got pushed back on this though.
> >>>
> >>>
> >> We are exploring various means of integration. The Jira mentioned
> >> providing
> >> Matrix and Vector implementations as an initial exploration. That task
> by
> >> itself had a lot of value in terms of reconciling some ground level
> issues
> >> (build/mvn compatibility, highlighting some classloader related
> challenges
> >> etc. on the H2O side.) Plugging behind a common DSL makes sense, though
> >> there may be value in other points of integration too, to exploit H2O's
> >> strengths.
> >>
> >>
> >>
> >>>
> >>>  At a high level it appears that implementing physical operators for
> >>>> DrmLike over H2O does not seem extremely challenging. All the
> operators
> >>>> in
> >>>> the DSL seem to have at least an approximate equivalent in H2O's own
> >>>> (R-like) DSL, and wiring one operator with another's implementation
> >>>> seems
> >>>> like a tractable problem.
> >>>>
> >>>>
> >>> It should be tractable, sure, even for map reduce. The question is
> >>> whether
> >>> there's enough diversity to do certain optimizations in a certain way.
> >>> E.g.
> >>> if two matrices are identically partitioned, then do map-side zip
> instead
> >>> of actual parallel join etc.
> >>>
> >>> But it should be tractable, indeed.
> >>>
> >>>
> >>
> >> Yes, H2O has ways to do such things - a single map/reduce task on two
> >> matrices "side by side" which are similarly partitioned (i.e, sharing
> the
> >> same VectorGroup in H2O terminology)
> >>
> >>
> >>
> >>  The reason I write, is to better understand the split between the
> Mahout
> >>>
> >>>> DSL and Spark (both current and future). As of today, the DSL seems to
> >>>> be
> >>>> pretty tightly coupled with Spark.
> >>>>
> >>>> E.g:
> >>>>
> >>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
> >>>>
> >>>>
> >>> This is a known thing, I think i noted it somewhere in jira. That, and
> >>> rdd
> >>> property of CheckpointedDRM. This needs to be abstracted away.
> >>>
> >>>
> >>>  - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
> >>>> DrmRddInput (instead of, say, DrmLike)
> >>>>
> >>>>
> >>> CheckpointAction is part of physical layer. This is something that
> would
> >>> have to be completely re-written for a new engine. This is the "plugin"
> >>> api, but it is never user-facing (logical plan facing).
> >>>
> >>>
> >> It somehow felt that the optimizer was logical-ish. Do you mean the
> >> optimizations in CheckpointAction are specific to Spark and cannot be
> >> inherited in general to other backends (not that I think that is wrong)?
> >>
> >>
> >>
> >>>  Firstly, I don't think I am presenting some new revelation you guys
> >>>> don't
> >>>> already know - I'm sure you know that the logical vs physical "split"
> in
> >>>> the DSL is not absolute (yet).
> >>>>
> >>>>
> >>> Aha. Exactly
> >>>
> >>>
> >>>
> >>>> That being said, I would like to understand if there are plans, or
> >>>> efforts already underway to make the DSL (i.e how DSSVD would be
> >>>> written)
> >>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and
> >>>> move
> >>>> the Spark specific code entirely into the physical domain. I recall
> >>>> Dmitry
> >>>> mentioning that a new engine other than Spark was also being planned,
> >>>> therefore I deduce some thought for such "purification" has already
> been
> >>>> applied.
> >>>>
> >>>>
> >>> Aha. The hope is for Stratosphere. But there are few items that need to
> >>> be
> >>> done by Stratosphere folks before we can leverage it fully. Or, let's
> >>> say,
> >>> leverage it much better than we otherwise could. Makes sense to wait a
> >>> bit.
> >>>
> >>>
> >>>
> >>>> It would be nice to see changes approximately like:
> >>>>
> >>>> Rename ./spark => ./dsl
> >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
> >>>> ./dsl/src/main/scala/org/apache/mahout/dsl
> >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
> >>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
> >>>>
> >>>>
> >>> i was thinking along the lines factoring out public traits and logical
> >>> operators (DRMLike etc.)  out of spark module into independent module
> >>> without particular engine dependencies. Exactly. It just hasn't come to
> >>> that yet.
> >>>
> >>>
> >>>  along with appropriately renaming packages and imports, and confining
> >>>> references to RDD and SparkContext completely within spark-backend.
> >>>>
> >>>> I think such a clean split would be necessary to introduce more
> backend
> >>>> engines. If no efforts are already underway, I would be glad to take
> on
> >>>> the
> >>>> DSL "purification" task.
> >>>>
> >>>>
> >>> i think you got very close to my thinking about further steps here.
> Like
> >>> i
> >>> said, i was just idling in wait for something like Stratosphere to
> become
> >>> closer to our orbit.
> >>>
> >>>
> >> OK, I think there is reasonable alignment on the goal. But you were not
> >> clear on whether you are going to be doing the purification split in the
> >> near future? or is that still an "unassigned task" which I can pick up?
> >>
> >> Avati
> >>
> >>
> >
>

Re: Mahout DSL vs Spark

Posted by Anand Avati <av...@gluster.org>.
Sebastian,
I'm still not sure how big or small problem the implicit val is. But I will
keep the point in the back of my mind as I explore further. I agree that
the flexibility of backends is a powerful feature making Mahout unique and
attractive indeed. I hope to see the flexibility be exercised purely at
runtime.



On Sun, Apr 27, 2014 at 11:11 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Anand,
>
> I'd also love to see work on a cleaner separation between the DSL and
> Spark. Another thing that should be tackled in the current code is that the
> SparkContext has to be present as implicit val in some methods.
>
> Making the DSL run on different systems will be a powerful feature that
> will make Mahout unique and attractive to a lot of users, as it doesn't
> enforce a lock-in to a particular system. I've talked to a company recently
> that exactly had this requirement, they decided against using Spark, but
> would still be highly interested in running new Mahout recommenders built
> using the DSL.
>
> --sebastian
>
>
>
> On 04/28/2014 05:39 AM, Anand Avati wrote:
>
>> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>>
>>>
>>>
>>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <av...@gluster.org> wrote:
>>>
>>>  Hi Ted, Dmitry,
>>>> Background: I am exploring the feasibility of providing H2O distributed
>>>> "backend" to the DSL.
>>>>
>>>>
>>> Very cool. that's actually was one of my initial proposals on how to
>>> approach this. Got pushed back on this though.
>>>
>>>
>> We are exploring various means of integration. The Jira mentioned
>> providing
>> Matrix and Vector implementations as an initial exploration. That task by
>> itself had a lot of value in terms of reconciling some ground level issues
>> (build/mvn compatibility, highlighting some classloader related challenges
>> etc. on the H2O side.) Plugging behind a common DSL makes sense, though
>> there may be value in other points of integration too, to exploit H2O's
>> strengths.
>>
>>
>>
>>>
>>>  At a high level it appears that implementing physical operators for
>>>> DrmLike over H2O does not seem extremely challenging. All the operators
>>>> in
>>>> the DSL seem to have at least an approximate equivalent in H2O's own
>>>> (R-like) DSL, and wiring one operator with another's implementation
>>>> seems
>>>> like a tractable problem.
>>>>
>>>>
>>> It should be tractable, sure, even for map reduce. The question is
>>> whether
>>> there's enough diversity to do certain optimizations in a certain way.
>>> E.g.
>>> if two matrices are identically partitioned, then do map-side zip instead
>>> of actual parallel join etc.
>>>
>>> But it should be tractable, indeed.
>>>
>>>
>>
>> Yes, H2O has ways to do such things - a single map/reduce task on two
>> matrices "side by side" which are similarly partitioned (i.e, sharing the
>> same VectorGroup in H2O terminology)
>>
>>
>>
>>  The reason I write, is to better understand the split between the Mahout
>>>
>>>> DSL and Spark (both current and future). As of today, the DSL seems to
>>>> be
>>>> pretty tightly coupled with Spark.
>>>>
>>>> E.g:
>>>>
>>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>>>
>>>>
>>> This is a known thing, I think i noted it somewhere in jira. That, and
>>> rdd
>>> property of CheckpointedDRM. This needs to be abstracted away.
>>>
>>>
>>>  - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>>>> DrmRddInput (instead of, say, DrmLike)
>>>>
>>>>
>>> CheckpointAction is part of physical layer. This is something that would
>>> have to be completely re-written for a new engine. This is the "plugin"
>>> api, but it is never user-facing (logical plan facing).
>>>
>>>
>> It somehow felt that the optimizer was logical-ish. Do you mean the
>> optimizations in CheckpointAction are specific to Spark and cannot be
>> inherited in general to other backends (not that I think that is wrong)?
>>
>>
>>
>>>  Firstly, I don't think I am presenting some new revelation you guys
>>>> don't
>>>> already know - I'm sure you know that the logical vs physical "split" in
>>>> the DSL is not absolute (yet).
>>>>
>>>>
>>> Aha. Exactly
>>>
>>>
>>>
>>>> That being said, I would like to understand if there are plans, or
>>>> efforts already underway to make the DSL (i.e how DSSVD would be
>>>> written)
>>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and
>>>> move
>>>> the Spark specific code entirely into the physical domain. I recall
>>>> Dmitry
>>>> mentioning that a new engine other than Spark was also being planned,
>>>> therefore I deduce some thought for such "purification" has already been
>>>> applied.
>>>>
>>>>
>>> Aha. The hope is for Stratosphere. But there are few items that need to
>>> be
>>> done by Stratosphere folks before we can leverage it fully. Or, let's
>>> say,
>>> leverage it much better than we otherwise could. Makes sense to wait a
>>> bit.
>>>
>>>
>>>
>>>> It would be nice to see changes approximately like:
>>>>
>>>> Rename ./spark => ./dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>>>> ./dsl/src/main/scala/org/apache/mahout/dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>>>
>>>>
>>> i was thinking along the lines factoring out public traits and logical
>>> operators (DRMLike etc.)  out of spark module into independent module
>>> without particular engine dependencies. Exactly. It just hasn't come to
>>> that yet.
>>>
>>>
>>>  along with appropriately renaming packages and imports, and confining
>>>> references to RDD and SparkContext completely within spark-backend.
>>>>
>>>> I think such a clean split would be necessary to introduce more backend
>>>> engines. If no efforts are already underway, I would be glad to take on
>>>> the
>>>> DSL "purification" task.
>>>>
>>>>
>>> i think you got very close to my thinking about further steps here. Like
>>> i
>>> said, i was just idling in wait for something like Stratosphere to become
>>> closer to our orbit.
>>>
>>>
>> OK, I think there is reasonable alignment on the goal. But you were not
>> clear on whether you are going to be doing the purification split in the
>> near future? or is that still an "unassigned task" which I can pick up?
>>
>> Avati
>>
>>
>

Re: Mahout DSL vs Spark

Posted by Sebastian Schelter <ss...@apache.org>.
Anand,

I'd also love to see work on a cleaner separation between the DSL and 
Spark. Another thing that should be tackled in the current code is that 
the SparkContext has to be present as implicit val in some methods.

Making the DSL run on different systems will be a powerful feature that 
will make Mahout unique and attractive to a lot of users, as it doesn't 
enforce a lock-in to a particular system. I've talked to a company 
recently that exactly had this requirement, they decided against using 
Spark, but would still be highly interested in running new Mahout 
recommenders built using the DSL.

--sebastian


On 04/28/2014 05:39 AM, Anand Avati wrote:
> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>>
>>
>>
>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <av...@gluster.org> wrote:
>>
>>> Hi Ted, Dmitry,
>>> Background: I am exploring the feasibility of providing H2O distributed
>>> "backend" to the DSL.
>>>
>>
>> Very cool. that's actually was one of my initial proposals on how to
>> approach this. Got pushed back on this though.
>>
>
> We are exploring various means of integration. The Jira mentioned providing
> Matrix and Vector implementations as an initial exploration. That task by
> itself had a lot of value in terms of reconciling some ground level issues
> (build/mvn compatibility, highlighting some classloader related challenges
> etc. on the H2O side.) Plugging behind a common DSL makes sense, though
> there may be value in other points of integration too, to exploit H2O's
> strengths.
>
>
>>
>>
>>> At a high level it appears that implementing physical operators for
>>> DrmLike over H2O does not seem extremely challenging. All the operators in
>>> the DSL seem to have at least an approximate equivalent in H2O's own
>>> (R-like) DSL, and wiring one operator with another's implementation seems
>>> like a tractable problem.
>>>
>>
>> It should be tractable, sure, even for map reduce. The question is whether
>> there's enough diversity to do certain optimizations in a certain way. E.g.
>> if two matrices are identically partitioned, then do map-side zip instead
>> of actual parallel join etc.
>>
>> But it should be tractable, indeed.
>>
>
>
> Yes, H2O has ways to do such things - a single map/reduce task on two
> matrices "side by side" which are similarly partitioned (i.e, sharing the
> same VectorGroup in H2O terminology)
>
>
>
>> The reason I write, is to better understand the split between the Mahout
>>> DSL and Spark (both current and future). As of today, the DSL seems to be
>>> pretty tightly coupled with Spark.
>>>
>>> E.g:
>>>
>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>>
>>
>> This is a known thing, I think i noted it somewhere in jira. That, and rdd
>> property of CheckpointedDRM. This needs to be abstracted away.
>>
>>
>>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>>> DrmRddInput (instead of, say, DrmLike)
>>>
>>
>> CheckpointAction is part of physical layer. This is something that would
>> have to be completely re-written for a new engine. This is the "plugin"
>> api, but it is never user-facing (logical plan facing).
>>
>
> It somehow felt that the optimizer was logical-ish. Do you mean the
> optimizations in CheckpointAction are specific to Spark and cannot be
> inherited in general to other backends (not that I think that is wrong)?
>
>
>>
>>> Firstly, I don't think I am presenting some new revelation you guys don't
>>> already know - I'm sure you know that the logical vs physical "split" in
>>> the DSL is not absolute (yet).
>>>
>>
>> Aha. Exactly
>>
>>
>>>
>>> That being said, I would like to understand if there are plans, or
>>> efforts already underway to make the DSL (i.e how DSSVD would be written)
>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move
>>> the Spark specific code entirely into the physical domain. I recall Dmitry
>>> mentioning that a new engine other than Spark was also being planned,
>>> therefore I deduce some thought for such "purification" has already been
>>> applied.
>>>
>>
>> Aha. The hope is for Stratosphere. But there are few items that need to be
>> done by Stratosphere folks before we can leverage it fully. Or, let's say,
>> leverage it much better than we otherwise could. Makes sense to wait a bit.
>>
>>
>>>
>>> It would be nice to see changes approximately like:
>>>
>>> Rename ./spark => ./dsl
>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>>> ./dsl/src/main/scala/org/apache/mahout/dsl
>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>>
>>
>> i was thinking along the lines factoring out public traits and logical
>> operators (DRMLike etc.)  out of spark module into independent module
>> without particular engine dependencies. Exactly. It just hasn't come to
>> that yet.
>>
>>
>>> along with appropriately renaming packages and imports, and confining
>>> references to RDD and SparkContext completely within spark-backend.
>>>
>>> I think such a clean split would be necessary to introduce more backend
>>> engines. If no efforts are already underway, I would be glad to take on the
>>> DSL "purification" task.
>>>
>>
>> i think you got very close to my thinking about further steps here. Like i
>> said, i was just idling in wait for something like Stratosphere to become
>> closer to our orbit.
>>
>
> OK, I think there is reasonable alignment on the goal. But you were not
> clear on whether you are going to be doing the purification split in the
> near future? or is that still an "unassigned task" which I can pick up?
>
> Avati
>


Re: Mahout DSL vs Spark

Posted by Anand Avati <av...@gluster.org>.
On Mon, Apr 28, 2014 at 9:48 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
>
> On Sun, Apr 27, 2014 at 8:39 PM, Anand Avati <av...@gluster.org> wrote:
>
>> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>>
>>>
>>>
>>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <av...@gluster.org> wrote:
>>>
>>>> Hi Ted, Dmitry,
>>>> Background: I am exploring the feasibility of providing H2O distributed
>>>> "backend" to the DSL.
>>>>
>>>
>>>
>>>
>>
>> Yes, H2O has ways to do such things - a single map/reduce task on two
>> matrices "side by side" which are similarly partitioned (i.e, sharing the
>> same VectorGroup in H2O terminology)
>>
>
> Ok. I  guess another question i had was about internal data
> representation.
> First,  distributed architecture assumes the engines are agnostic of type
> of payload as long as external serialization is provided. The way it was
> explained so far was h2o is tightly bound to a particular data
> representation in the back.
> Second, what we do here in Mahout is we assume the back end can make data
> available in form of vertical Matrix blocks to user closures running in the
> backend.
>

H2O's natural direction is column optimizing. So the user closures running
in the backend would encounter blocks of hortizontal Matrix blocks.


> Again, it was repeatedly explained that h2o has no matrix representation
> for backend things
>

H2O has a strong 2-D "Frame". The Matrix abstraction over it was what built
in github.com/tdunning/h2o-matrix, which is mostly providing Matrix-ish
sounding names to functionality which already existed on a H2O Frame.


> So it looks like we cannot plug mahout-math  as backend blockwise matrix
> representation, nor we have access to an alternative Matrix based vertical
> blocking. How that is to be resolved, in your opinion?
>

We can trivially provide (sub-)Matrix access with horizontal blocking in
H2O's mapreduce() - i.e, the mapper method in H2O's map/reduce API gets
access to a batch of rows, local to the compute node, one batch per mapper
call. This is almost natural to H2O. The per-row mapper API in H2OMatrix is
a wrapper around the per-rowbatch internal API. And I think the horizontal
vs vertical is an arbitrary choice and a reconcilable problem
(transparently transposing the matrix in the H2OMatrix layer).


>>
>>> The reason I write, is to better understand the split between the Mahout
>>>> DSL and Spark (both current and future). As of today, the DSL seems to be
>>>> pretty tightly coupled with Spark.
>>>>
>>>>  E.g:
>>>>
>>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>>>
>>>
>>> This is a known thing, I think i noted it somewhere in jira. That, and
>>> rdd property of CheckpointedDRM. This needs to be abstracted away.
>>>
>>>
>>>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>>>> DrmRddInput (instead of, say, DrmLike)
>>>>
>>>
>>> CheckpointAction is part of physical layer. This is something that would
>>> have to be completely re-written for a new engine. This is the "plugin"
>>> api, but it is never user-facing (logical plan facing).
>>>
>>
>> It somehow felt that the optimizer was logical-ish. Do you mean the
>> optimizations in CheckpointAction are specific to Spark and cannot be
>> inherited in general to other backends (not that I think that is wrong)?
>>
>
> Well there are 3 things. There's logical plan (operator DAG), there's a
> physical DAG, and there's optimizer rewrite & cost logic that transforms
> logical into physical.
>
> Logical DAG is user-facing and the top level. However, IMO logic that
> makes rewrites into physical DAG, IMO should be engine specific in order to
> be able to capitalize on engine specific things. It probably would share a
> lot of commonalities (e.g. we could just maintain common pool of physical
> operators assuming some commonalities between physical engine
> implementations) but the cost-rewriting part should still be specific even
> if it is very similar to any of existing.
>
> I also want to reserve a future work for spark optimizer exclusively that
> calls up on advanced dynamic load scheduling techniques that were
> thoroughly investigated in SystemML project.
>
>
>>
>>
>>>
>>>> Firstly, I don't think I am presenting some new revelation you guys
>>>> don't already know - I'm sure you know that the logical vs physical "split"
>>>> in the DSL is not absolute (yet).
>>>>
>>>
>>> Aha. Exactly
>>>
>>>
>>>>
>>>> That being said, I would like to understand if there are plans, or
>>>> efforts already underway to make the DSL (i.e how DSSVD would be written)
>>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move
>>>> the Spark specific code entirely into the physical domain. I recall Dmitry
>>>> mentioning that a new engine other than Spark was also being planned,
>>>> therefore I deduce some thought for such "purification" has already been
>>>> applied.
>>>>
>>>
>>> Aha. The hope is for Stratosphere. But there are few items that need to
>>> be done by Stratosphere folks before we can leverage it fully. Or, let's
>>> say, leverage it much better than we otherwise could. Makes sense to wait a
>>> bit.
>>>
>>>
>>>>
>>>> It would be nice to see changes approximately like:
>>>>
>>>> Rename ./spark => ./dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>>>> ./dsl/src/main/scala/org/apache/mahout/dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>>>
>>>
>>> i was thinking along the lines factoring out public traits and logical
>>> operators (DRMLike etc.)  out of spark module into independent module
>>> without particular engine dependencies. Exactly. It just hasn't come to
>>> that yet.
>>>
>>>
>>>> along with appropriately renaming packages and imports, and confining
>>>> references to RDD and SparkContext completely within spark-backend.
>>>>
>>>> I think such a clean split would be necessary to introduce more backend
>>>> engines. If no efforts are already underway, I would be glad to take on the
>>>> DSL "purification" task.
>>>>
>>>
>>> i think you got very close to my thinking about further steps here. Like
>>> i said, i was just idling in wait for something like Stratosphere to become
>>> closer to our orbit.
>>>
>>
>> OK, I think there is reasonable alignment on the goal. But you were not
>> clear on whether you are going to be doing the purification split in the
>> near future? or is that still an "unassigned task" which I can pick up?
>>
> yes it is unassigned and frankly i thought i might want to continue
> working on this separation. However, you are welcome to take a stab,
> especially if you are see a clear path for implementing mapBlock() operator
> in h20  per my questions above without changing its signatures.
>

Subject to my understanding that mapBlock() slices a matrix into batches of
columns, I am tempted to believe the signature need not change. I am not
too concerned about the row vs column orientation just yet.

Avati

Re: Mahout DSL vs Spark

Posted by Anand Avati <av...@gluster.org>.
On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
>
> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <av...@gluster.org> wrote:
>
>> Hi Ted, Dmitry,
>> Background: I am exploring the feasibility of providing H2O distributed
>> "backend" to the DSL.
>>
>
> Very cool. that's actually was one of my initial proposals on how to
> approach this. Got pushed back on this though.
>

We are exploring various means of integration. The Jira mentioned providing
Matrix and Vector implementations as an initial exploration. That task by
itself had a lot of value in terms of reconciling some ground level issues
(build/mvn compatibility, highlighting some classloader related challenges
etc. on the H2O side.) Plugging behind a common DSL makes sense, though
there may be value in other points of integration too, to exploit H2O's
strengths.


>
>
>> At a high level it appears that implementing physical operators for
>> DrmLike over H2O does not seem extremely challenging. All the operators in
>> the DSL seem to have at least an approximate equivalent in H2O's own
>> (R-like) DSL, and wiring one operator with another's implementation seems
>> like a tractable problem.
>>
>
> It should be tractable, sure, even for map reduce. The question is whether
> there's enough diversity to do certain optimizations in a certain way. E.g.
> if two matrices are identically partitioned, then do map-side zip instead
> of actual parallel join etc.
>
> But it should be tractable, indeed.
>


Yes, H2O has ways to do such things - a single map/reduce task on two
matrices "side by side" which are similarly partitioned (i.e, sharing the
same VectorGroup in H2O terminology)



> The reason I write, is to better understand the split between the Mahout
>> DSL and Spark (both current and future). As of today, the DSL seems to be
>> pretty tightly coupled with Spark.
>>
>> E.g:
>>
>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>
>
> This is a known thing, I think i noted it somewhere in jira. That, and rdd
> property of CheckpointedDRM. This needs to be abstracted away.
>
>
>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>> DrmRddInput (instead of, say, DrmLike)
>>
>
> CheckpointAction is part of physical layer. This is something that would
> have to be completely re-written for a new engine. This is the "plugin"
> api, but it is never user-facing (logical plan facing).
>

It somehow felt that the optimizer was logical-ish. Do you mean the
optimizations in CheckpointAction are specific to Spark and cannot be
inherited in general to other backends (not that I think that is wrong)?


>
>> Firstly, I don't think I am presenting some new revelation you guys don't
>> already know - I'm sure you know that the logical vs physical "split" in
>> the DSL is not absolute (yet).
>>
>
> Aha. Exactly
>
>
>>
>> That being said, I would like to understand if there are plans, or
>> efforts already underway to make the DSL (i.e how DSSVD would be written)
>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move
>> the Spark specific code entirely into the physical domain. I recall Dmitry
>> mentioning that a new engine other than Spark was also being planned,
>> therefore I deduce some thought for such "purification" has already been
>> applied.
>>
>
> Aha. The hope is for Stratosphere. But there are few items that need to be
> done by Stratosphere folks before we can leverage it fully. Or, let's say,
> leverage it much better than we otherwise could. Makes sense to wait a bit.
>
>
>>
>> It would be nice to see changes approximately like:
>>
>> Rename ./spark => ./dsl
>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>> ./dsl/src/main/scala/org/apache/mahout/dsl
>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>
>
> i was thinking along the lines factoring out public traits and logical
> operators (DRMLike etc.)  out of spark module into independent module
> without particular engine dependencies. Exactly. It just hasn't come to
> that yet.
>
>
>> along with appropriately renaming packages and imports, and confining
>> references to RDD and SparkContext completely within spark-backend.
>>
>> I think such a clean split would be necessary to introduce more backend
>> engines. If no efforts are already underway, I would be glad to take on the
>> DSL "purification" task.
>>
>
> i think you got very close to my thinking about further steps here. Like i
> said, i was just idling in wait for something like Stratosphere to become
> closer to our orbit.
>

OK, I think there is reasonable alignment on the goal. But you were not
clear on whether you are going to be doing the purification split in the
near future? or is that still an "unassigned task" which I can pick up?

Avati