You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2014/04/28 11:18:08 UTC

Straw poll re: H2O ?

On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
<ji...@apache.org> wrote:
> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly what is encouraging the h2o work to be done a bit apart. It simply isn't efficient to have to answer so many off-topic points whenever any reports on work in progress are given.
>
> I think this has been the off-topic here.
>
> Calling my comments "emotional" or "non-technical", or _loosely_ paraphrasing me.

Yes, the personal finger-pointing parts don't belong and don't
convince anyone, let's skip those.

>From the sidelines, I see a bunch of work intended for Mahout
proceeding outside the community such as it is, and even Apache. Of
course, contributions are always prepped externally to some degree. I
create, debug, change patches before posting them, maybe checking in
early on choices that others may want input on.

This is a large-ish change being proposed, IIUC. I can see one person
who publicly, and at least two who privately, have clear reservations
about this direction. It certainly appears funny vis-a-vis the "Apache
way" to work on a contribution *because* one (or more) other
committers aren't convinced.

I don't think that's important to dither about. What is, is this: if a
big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
Nobody can pre-judge his/her opinion on a proposal that's not tabled
yet, but it seems like a quite possible outcome.

Would be a shame to do a lot of work, intending it for a commit, and
then find there is not consensus.

So is it better to figure out earlier than later whether these 2+
parallel tracks have enough commonality to coexist? is there anything
to VOTE on? is there any baby-step change that everyone agrees on that
unifies the efforts?

Tangent:

A post to the incubator list yesterday noted that many new incubator
projects overlapped a whole lot with other projects. Redundancy is OK
per se; Apache has no "product portfolio" to rationalize. But if
people are finding it easier to set up a new separate project rather
than try to participate in an existing one, that's bad for
communities.

I am not so against new projects. Let a thousand flowers bloom and see
which survive. There's already a powerful incentive to join forces to
gain critical mass. Are there really separate projects here? there may
not be much overlap anyway!

Would it be better to model it that way? what about Apache H2O
incubating? what about Apache, um, "Spathi" incubating or something
for the Spark + Math efforts? It's almost less confusing to treat
either of these as "Mahout 2" as they are quite different from the
just-retired codebase.

Put another way: if any such effort could *not* be an incubating
project, can it really be the replacement future for Mahout? or do
these need to start again in the incubator?

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mon, Apr 28, 2014 at 3:01 PM, Anand Avati <av...@gluster.org> wrote:

>
>
> The initial approach of allowing implementation of algo in Java language
> did follow this approach. There is an example KMeans implemented in Java in
> "Mahout style" -
>
> https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java-
> that works, even in a distributed way. But the focus now has been to
> do
> something similar through the DSL rather than Java lang - which is really
> the interesting part of the integration. I can only hope that the entire
> effort does not get prematurely dismissed or killed purely on a matter of
> principle!
>
>
Ok, Anand, if i am allowed to make a suggestion, try to use those exact
concerns (and all smaller ones that may lead to individual mahout commits)
to form a few smaller issues and file them separately with clearly defined
intent as you just did.

Also if you intend to file an exploratory issue for the sole purpose of
exchanging of opinions, rather than a final contribution work, please
clearly say so in the name and description. By default all issues are
considered as having a contribution intent, and given consideration from
that point of view.

When one files an issue called "h2o integration" with descripton as it was,
it is too loose, too bulky and too open to interpretation. Even Mahout PMC
Chair inquired about Software grant, let alone all the rest of the spectrum
of conjectures that resulted. This (software grant) does not happen for a
simple contribution, this implies the entire project code base move. Your
intent clearly have been benign, but this is an example of not so well
stated intent in the issue and resulted in too much energy channeled on
wishful thinking. Hope this did not come off as lecturing on jira -- truly
it was not my intent.

>
>

Re: Straw poll re: H2O ?

Posted by Anand Avati <av...@gluster.org>.

On Mon, Apr 28, 2014 at 2:45 PM, Pat Ferrel <pa...@gmail.com> wrote:

> I favor iterative dev (implement quick then refactor to perfection) and so
> no I would not try to make the DSL or specific imports support more than
> one backend _until_ we support one. Doing two at once? I’ve seen too many
> projects that fail because of this. Why not tackle equal support for
> multiple backends as a refactoring task after one backend is fully
> implemented?
>
> Dmitriy is free to do as he wishes, of course, but he is spending an awful
> lot of time dealing with this.
>
> Multiple backends sounds fine in principal but the need has not be
> demonstrated, therefore it seems like a question to be deferred.
>

I'm explicitly staying away from responding to these points before we
rat-hole down a discussion of best principles and opinions. As you say, it
is best deferred till we have demonstrable work.


> Have you considered doing a much scaled back, single algo implementation
> as an example? It wouldn’t require changing the DSL or any code already in
> Mahout and would add something useful quickly. All you need is file level
> compatibility. Besides it might go towards demonstrating the need.


The initial approach of allowing implementation of algo in Java language
did follow this approach. There is an example KMeans implemented in Java in
"Mahout style" -
https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java-
that works, even in a distributed way. But the focus now has been to
do
something similar through the DSL rather than Java lang - which is really
the interesting part of the integration. I can only hope that the entire
effort does not get prematurely dismissed or killed purely on a matter of
principle!

Thanks

On Apr 28, 2014, at 2:13 PM, Anand Avati <av...@gluster.org> wrote:
>
> Saikat, Pat,
>
> For background, please refer to the "Mahout DSL vs Spark" discussion the
> for the general direction in which the integration is being explored. With
> that background, I would like to present some counter questions:
>
> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> separation if not for providing multiple compute backends?
>
> 2. Does the proposal of having a new DSL backend in the future (for e.g
> stratosphere as suggested elsewhere) make you:
> -- propose mahout-stratosphere as a different top level project?
> -- worry that stratosphere would be a dependency to Mahout?
> -- worry that you won't be able to say "Future of Mahout is Spark .. but it
> also supports stratosphere"?
> -- worry that as a user/commiter/contributor you have to worry about a new
> framework?
> -- resist having a DSL backend for stratosphere because Hadoop vendors may
> not support it?
>
> Obviously no, since they are all just different DSL backends.
>
> Have you guys embraced the idea that the DSL allows for multiple backends
> (Spark being the first to get implemented)? or Not? Hence I do not
> understand the "problem" here.
>
> Thanks
>
> On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <sxk1969@hotmail.com
> >wrote:
>
> > I would echo Pat's sentiments spot on related to the goal of supporting
> > both spark and H2O confusing folks that are interested in using,
> committing
> > to and trying to understand where Mahout is headed small to medium term.
> > I hate to throw this out but given the amount of "sometimes not so nice
> > back and forths I've seen on issue 1500" I really wonder whether we
> should
> > have mahout-spark and mahout-h2o as two different top level projects
> > potentially supporting a different set of algorithms underneath, yes I
> know
> > tieing mahout to a particular technology goes against the initial vision
> > but given the churn I'm seeing I'm not sure I understand what the current
> > vision even is :)
> >
> >> Subject: Re: Straw poll re: H2O ?
> >> From: pat@occamsmachete.com
> >> Date: Mon, 28 Apr 2014 13:17:03 -0700
> >> CC: ssc@apache.org
> >> To: dev@mahout.apache.org
> >>
> >> I haven’t heard a good explanation of what this project is. There should
> > be some small step like implementing an algo on h2o to takes the same
> input
> > as a current Hadoop Mahout job and produce the same result or do one not
> > already in Mahout. At least it will answer some technical questions and
> > shouldn’t take a lot of support from current committers to produce.
> >>
> >> I’m still not convinced that this is the primary thing that should drive
> > making it a Mahout dependency.
> >>
> >> I’m highly dubious of actively supporting and working on Mahout for
> > Spark and h2o. Not for technical reasons but because rebooting Mahout on
> > two platforms seems a non-starter. No project manager in the commercial
> > world would allow that sort of thing. And rightly so, it confuses users,
> > committers, contributors. You shouldn’t have a great deal of redundancy
> or
> > competing efforts _inside_ a project even an open source one. That’s for
> > separate projects and the incubator, right? There are plenty of examples
> of
> > going that route, Spark itself is redundant with Hadoop in many ways.
> Would
> > Apache accept h2o as a parallel project to Spark, if so why not do that?
> >>
> >> Question: Where do we (Mahout user, committer, contributor) invest
> > extremely precious time learning new languages, frameworks, architecture,
> > configurations, optimizations?
> >>
> >> Answer: Many will simply not choose but wait and see, or go elsewhere.
> >>
> >> Why? Because we fail to communicate “the future of Mahout is Spark
> > first—period” It keeps coming out "Spark and, well, h2o too”
> >>
> >> That is a momentum killer.  If we’re agreed on “Spark first” then
> > there’s no need to incubate Mahout 2, Spark and Mahout have already gone
> > through that and though Dmitriy’s DSL and Scala shell work is entirely
> new,
> > to the end user the jobs, input and output, and functionality will look
> > like a v2. People dealing with internals will see a different world but
> > they should be a minority of users and will hopefully like what they see.
> >>
> >>
> >> Somewhat off subject notes on external politics:
> >>
> >> We really need to make sure Mahout stays in all the big distros. That
> > means Sebastian’s comments are spot on: "The best way to help Mahout is
> to
> > pick up some of the work that needs to be done with regards to
> > documentation, examples, Hadoop 2 compatibility and designing the future,
> > especially with regards to dataframes”  All the distros are hadoop 2.
> >>
> >> Incubating Mahout 2 as another project is surely a way out of the
> > distros, another momentum killer.
> >>
> >> Another political question is whether an h2o dependency would be an
> > issue to the distros. If we are going to put big efforts into h2o let’s
> see
> > how that plays out first. Spark is already supported by them, even
> > Hortonworks has taken a first step with 2.1. If Mahout is in a distro the
> > distro will be asked to support it, that’s what they are paid for. Do
> they
> > want to support h2o? I have no idea how they would react to that but it
> > affects Mahout.
> >>
> >>
> >> For all these reasons I’d be -1 to any big-bang integration.
> >>
> >>
> >>> On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >>>
> >>> +1. I don't think anyone said anything, privately or publicly, about
> > h20
> >>> integration being a bad idea. It's just there's more than one way to
> > do it,
> >>> so debate is focusing on exploration of pluses and minuses of each
> >>> individual proposal (as they come to light). Part of difficulty here
> > was
> >>> that the expertise intersection of all parts being connected and
> > integrated
> >>> has been pretty poor on individual basis. So we have to go by scenarios
> >>> where a group of specialized experts tries to figure out the solution.
> >>>
> >>> w.r.t to incubation proposals, it seems dubious for a number reasons.
> >>>
> >>> Reason 1 is that these projects are the primary factor moving Mahout
> >>> anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
> >>> frankly not much left in Mahout, so it is reflection of more or less
> > common
> >>> opinion that the project would just spiral down on its own if the
> > things
> >>> stay status-quo.
> >>>
> >>> Reason 2 is that there are good (not irreplaceable, but good)
> > components in
> >>> Mahout that these efforts depend on. Therefore, incubation would be
> > faced
> >>> with a perspective of having dependencies on project that on its own is
> >>> winding down. Not good for incubation side.
> >>>
> >>> Reason 3 is that current effort is (IMO) minimalistic enough not to
> > warrant
> >>> a new project. It simply doesn't, and can't have the scale of things
> > like
> >>> Spark or Hadoop eco. There would be just not enough substance for a new
> >>> project at this point. I don't feel very strong about this point
> > though.
> >>>
> >>>
> >>> On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> >>>
> >>>> We all should calm down here and remind ourselves why we are doing
> > this
> >>>> whole thing: Because we love open source and want to have a vibrant
> >>>> community and a great piece of software.
> >>>>
> >>>> Mahout has come a long way and is at a crossroads right now, so its
> > only
> >>>> natural that there are heated discussions. But, we should immediately
> > stop
> >>>> the fingerpointing and related stuff, we have managed to avoid this
> > since
> >>>> Mahout's inception and we should continue to do so.
> >>>>
> >>>> The best way to help Mahout is to pick up some of the work that needs
> > to
> >>>> be done with regards to documentation, examples, Hadoop 2
> > compatibility and
> >>>> designing the future, especially with regards to dataframes e.g.
> >>>>
> >>>> We agreed to give the h2O guys a shot for exploration of a possible
> >>>> integration into Mahout. We should be grateful that they are
> > investing a
> >>>> lot of time into this, and should help whereever we can. Once they
> > come up
> >>>> with a concrete proposal or patch, we will have a look at it, have a
> > deep,
> >>>> technical and polite discussion, and make a decision afterwards.
> >>>>
> >>>> --sebastian
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 04/28/2014 07:42 PM, Anand Avati wrote:
> >>>>
> >>>>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
> >>>>>
> >>>>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
> >>>>>> <ji...@apache.org> wrote:
> >>>>>>
> >>>>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly
> > what
> >>>>>>>
> >>>>>> is encouraging the h2o work to be done a bit apart. It simply isn't
> >>>>>> efficient to have to answer so many off-topic points whenever any
> > reports
> >>>>>> on work in progress are given.
> >>>>>>
> >>>>>>>
> >>>>>>> I think this has been the off-topic here.
> >>>>>>>
> >>>>>>> Calling my comments "emotional" or "non-technical", or _loosely_
> >>>>>>>
> >>>>>> paraphrasing me.
> >>>>>>
> >>>>>> Yes, the personal finger-pointing parts don't belong and don't
> >>>>>> convince anyone, let's skip those.
> >>>>>>
> >>>>>>
> >>>>> +1. Let's skip those.
> >>>>>
> >>>>>
> >>>>> From the sidelines, I see a bunch of work intended for Mahout
> >>>>>
> >>>>>> proceeding outside the community such as it is, and even Apache. Of
> >>>>>> course, contributions are always prepped externally to some degree.
> > I
> >>>>>> create, debug, change patches before posting them, maybe checking in
> >>>>>> early on choices that others may want input on.
> >>>>>>
> >>>>>> This is a large-ish change being proposed, IIUC. I can see one
> > person
> >>>>>> who publicly, and at least two who privately, have clear
> > reservations
> >>>>>> about this direction.
> >>>>>>
> >>>>>
> >>>>>
> >>>>> It will probably be a large-ish change, indeed. But my personal take
> > is
> >>>>> that, non-technical aspects of the debate is unfortunately taking
> >>>>> precedence over real technical parts. Please refer to email thread
> > "Mahout
> >>>>> DSL vs Spark".
> >>>>>
> >>>>>
> >>>>>
> >>>>> It certainly appears funny vis-a-vis the "Apache
> >>>>>> way" to work on a contribution *because* one (or more) other
> >>>>>> committers aren't convinced.
> >>>>>>
> >>>>>>
> >>>>> As mentioned in the referred email thread, a lot of the technical
> > issues
> >>>>> which got addressed in the work which was carried out outside of
> > Apache,
> >>>>> was really sorting out and highlighting build and classloader related
> >>>>> challenges on the H2O side. There was little motivation to carry out
> > those
> >>>>> discussions on the Mahout lists as it was really ~99% H2O specific
> >>>>> discussions and noise/spam to the Mahout community.
> >>>>>
> >>>>> I don't think that's important to dither about. What is, is this: if
> > a
> >>>>>
> >>>>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
> >>>>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
> >>>>>> yet, but it seems like a quite possible outcome.
> >>>>>>
> >>>>>>
> >>>>> As an outsider, my opinion is that the proposed need for a VOTE is a
> >>>>> largely masqueraded problem built around the perception of
> > disagreement
> >>>>> over something vague, abstract and inaccurate. And therefore
> > premature.
> >>>>> That being said the PMC may vote on any issues/non-issues it may
> > please.
> >>>>>
> >>>>> Would be a shame to do a lot of work, intending it for a commit, and
> >>>>>
> >>>>>> then find there is not consensus.
> >>>>>>
> >>>>>>
> >>>>> Exactly the kind of inaccurate perception I meant. While we are (at
> > least
> >>>>> I
> >>>>> am) exploring the best fit model for integration, and exploration by
> >>>>> definition involves taking potentially wrong steps and backtracking
> > if
> >>>>> necessary, the perception unfortunately seems to be that the proposed
> >>>>> intermediate (potentially wrong) steps are some kind of pre-decided
> > plan
> >>>>> of
> >>>>> action. So, no, there WOULDN'T be a lot of work intended for a commit
> >>>>> against consensus.
> >>>>>
> >>>>> So is it better to figure out earlier than later whether these 2+
> >>>>>
> >>>>>> parallel tracks have enough commonality to coexist?
> >>>>>>
> >>>>>
> >>>>>
> >>>>> Whether two parallel tracks (I assume the spark track and the H2O
> > track?)
> >>>>> have enough commonality to exist - one way you surely cannot get the
> > right
> >>>>> answer for this (except by co-incidence) is by taking a vote from a
> > group
> >>>>> who are experts in only either one of those tracks. From what I see,
> > most
> >>>>> of the opposition has been due to a combination of lack of
> > understanding
> >>>>> of
> >>>>> H2O and (welcome) skepticism. If, as a contributor, I find there is
> > no
> >>>>> natural or beneficial way to co-exist with Spark, I wouldn't waste
> > my time
> >>>>> writing code, and for sure am not dependent on another group's vote
> > to
> >>>>> make
> >>>>> that decision for me.
> >>>>>
> >>>>> Avati
> >>>>>
> >>>>>
> >>>>
> >>>
> >
> >
>
>

Re: Straw poll re: H2O ?

Posted by Frank Scholten <sc...@gmail.com>.

On Apr 28, 2014, at 23:45, Pat Ferrel <pa...@gmail.com> wrote:

> I favor iterative dev (implement quick then refactor to perfection) and so no I would not try to make the DSL or specific imports support more than one backend _until_ we support one.
> Doing two at once? I’ve seen too many projects that fail because of this. Why not tackle equal support for multiple backends as a refactoring task after one backend is fully implemented?
> 
> Dmitriy is free to do as he wishes, of course, but he is spending an awful lot of time dealing with this. 
> 
> Multiple backends sounds fine in principal but the need has not be demonstrated, therefore it seems like a question to be deferred.
> 
> Have you considered doing a much scaled back, single algo implementation as an example? It wouldn’t require changing the DSL or any code already in Mahout and would add something useful quickly. All you need is file level compatibility. Besides it might go towards demonstrating the need.

+1

> 
> 
> On Apr 28, 2014, at 2:13 PM, Anand Avati <av...@gluster.org> wrote:
> 
> Saikat, Pat,
> 
> For background, please refer to the "Mahout DSL vs Spark" discussion the
> for the general direction in which the integration is being explored. With
> that background, I would like to present some counter questions:
> 
> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> separation if not for providing multiple compute backends?
> 
> 2. Does the proposal of having a new DSL backend in the future (for e.g
> stratosphere as suggested elsewhere) make you:
> -- propose mahout-stratosphere as a different top level project?
> -- worry that stratosphere would be a dependency to Mahout?
> -- worry that you won't be able to say "Future of Mahout is Spark .. but it
> also supports stratosphere"?
> -- worry that as a user/commiter/contributor you have to worry about a new
> framework?
> -- resist having a DSL backend for stratosphere because Hadoop vendors may
> not support it?
> 
> Obviously no, since they are all just different DSL backends.
> 
> Have you guys embraced the idea that the DSL allows for multiple backends
> (Spark being the first to get implemented)? or Not? Hence I do not
> understand the "problem" here.
> 
> Thanks
> 
> On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <sx...@hotmail.com>wrote:
> 
>> I would echo Pat's sentiments spot on related to the goal of supporting
>> both spark and H2O confusing folks that are interested in using, committing
>> to and trying to understand where Mahout is headed small to medium term.
>> I hate to throw this out but given the amount of "sometimes not so nice
>> back and forths I've seen on issue 1500" I really wonder whether we should
>> have mahout-spark and mahout-h2o as two different top level projects
>> potentially supporting a different set of algorithms underneath, yes I know
>> tieing mahout to a particular technology goes against the initial vision
>> but given the churn I'm seeing I'm not sure I understand what the current
>> vision even is :)
>> 
>>> Subject: Re: Straw poll re: H2O ?
>>> From: pat@occamsmachete.com
>>> Date: Mon, 28 Apr 2014 13:17:03 -0700
>>> CC: ssc@apache.org
>>> To: dev@mahout.apache.org
>>> 
>>> I haven’t heard a good explanation of what this project is. There should
>> be some small step like implementing an algo on h2o to takes the same input
>> as a current Hadoop Mahout job and produce the same result or do one not
>> already in Mahout. At least it will answer some technical questions and
>> shouldn’t take a lot of support from current committers to produce.
>>> 
>>> I’m still not convinced that this is the primary thing that should drive
>> making it a Mahout dependency.
>>> 
>>> I’m highly dubious of actively supporting and working on Mahout for
>> Spark and h2o. Not for technical reasons but because rebooting Mahout on
>> two platforms seems a non-starter. No project manager in the commercial
>> world would allow that sort of thing. And rightly so, it confuses users,
>> committers, contributors. You shouldn’t have a great deal of redundancy or
>> competing efforts _inside_ a project even an open source one. That’s for
>> separate projects and the incubator, right? There are plenty of examples of
>> going that route, Spark itself is redundant with Hadoop in many ways. Would
>> Apache accept h2o as a parallel project to Spark, if so why not do that?
>>> 
>>> Question: Where do we (Mahout user, committer, contributor) invest
>> extremely precious time learning new languages, frameworks, architecture,
>> configurations, optimizations?
>>> 
>>> Answer: Many will simply not choose but wait and see, or go elsewhere.
>>> 
>>> Why? Because we fail to communicate “the future of Mahout is Spark
>> first—period” It keeps coming out "Spark and, well, h2o too”
>>> 
>>> That is a momentum killer.  If we’re agreed on “Spark first” then
>> there’s no need to incubate Mahout 2, Spark and Mahout have already gone
>> through that and though Dmitriy’s DSL and Scala shell work is entirely new,
>> to the end user the jobs, input and output, and functionality will look
>> like a v2. People dealing with internals will see a different world but
>> they should be a minority of users and will hopefully like what they see.
>>> 
>>> 
>>> Somewhat off subject notes on external politics:
>>> 
>>> We really need to make sure Mahout stays in all the big distros. That
>> means Sebastian’s comments are spot on: "The best way to help Mahout is to
>> pick up some of the work that needs to be done with regards to
>> documentation, examples, Hadoop 2 compatibility and designing the future,
>> especially with regards to dataframes”  All the distros are hadoop 2.
>>> 
>>> Incubating Mahout 2 as another project is surely a way out of the
>> distros, another momentum killer.
>>> 
>>> Another political question is whether an h2o dependency would be an
>> issue to the distros. If we are going to put big efforts into h2o let’s see
>> how that plays out first. Spark is already supported by them, even
>> Hortonworks has taken a first step with 2.1. If Mahout is in a distro the
>> distro will be asked to support it, that’s what they are paid for. Do they
>> want to support h2o? I have no idea how they would react to that but it
>> affects Mahout.
>>> 
>>> 
>>> For all these reasons I’d be -1 to any big-bang integration.
>>> 
>>> 
>>>> On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>>> 
>>>> +1. I don't think anyone said anything, privately or publicly, about
>> h20
>>>> integration being a bad idea. It's just there's more than one way to
>> do it,
>>>> so debate is focusing on exploration of pluses and minuses of each
>>>> individual proposal (as they come to light). Part of difficulty here
>> was
>>>> that the expertise intersection of all parts being connected and
>> integrated
>>>> has been pretty poor on individual basis. So we have to go by scenarios
>>>> where a group of specialized experts tries to figure out the solution.
>>>> 
>>>> w.r.t to incubation proposals, it seems dubious for a number reasons.
>>>> 
>>>> Reason 1 is that these projects are the primary factor moving Mahout
>>>> anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
>>>> frankly not much left in Mahout, so it is reflection of more or less
>> common
>>>> opinion that the project would just spiral down on its own if the
>> things
>>>> stay status-quo.
>>>> 
>>>> Reason 2 is that there are good (not irreplaceable, but good)
>> components in
>>>> Mahout that these efforts depend on. Therefore, incubation would be
>> faced
>>>> with a perspective of having dependencies on project that on its own is
>>>> winding down. Not good for incubation side.
>>>> 
>>>> Reason 3 is that current effort is (IMO) minimalistic enough not to
>> warrant
>>>> a new project. It simply doesn't, and can't have the scale of things
>> like
>>>> Spark or Hadoop eco. There would be just not enough substance for a new
>>>> project at this point. I don't feel very strong about this point
>> though.
>>>> 
>>>> 
>>>> On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>>> 
>>>>> We all should calm down here and remind ourselves why we are doing
>> this
>>>>> whole thing: Because we love open source and want to have a vibrant
>>>>> community and a great piece of software.
>>>>> 
>>>>> Mahout has come a long way and is at a crossroads right now, so its
>> only
>>>>> natural that there are heated discussions. But, we should immediately
>> stop
>>>>> the fingerpointing and related stuff, we have managed to avoid this
>> since
>>>>> Mahout's inception and we should continue to do so.
>>>>> 
>>>>> The best way to help Mahout is to pick up some of the work that needs
>> to
>>>>> be done with regards to documentation, examples, Hadoop 2
>> compatibility and
>>>>> designing the future, especially with regards to dataframes e.g.
>>>>> 
>>>>> We agreed to give the h2O guys a shot for exploration of a possible
>>>>> integration into Mahout. We should be grateful that they are
>> investing a
>>>>> lot of time into this, and should help whereever we can. Once they
>> come up
>>>>> with a concrete proposal or patch, we will have a look at it, have a
>> deep,
>>>>> technical and polite discussion, and make a decision afterwards.
>>>>> 
>>>>> --sebastian
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 04/28/2014 07:42 PM, Anand Avati wrote:
>>>>> 
>>>>>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>> 
>>>>>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
>>>>>>> <ji...@apache.org> wrote:
>>>>>>> 
>>>>>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly
>> what
>>>>>>> is encouraging the h2o work to be done a bit apart. It simply isn't
>>>>>>> efficient to have to answer so many off-topic points whenever any
>> reports
>>>>>>> on work in progress are given.
>>>>>>> 
>>>>>>>> 
>>>>>>>> I think this has been the off-topic here.
>>>>>>>> 
>>>>>>>> Calling my comments "emotional" or "non-technical", or _loosely_
>>>>>>> paraphrasing me.
>>>>>>> 
>>>>>>> Yes, the personal finger-pointing parts don't belong and don't
>>>>>>> convince anyone, let's skip those.
>>>>>> +1. Let's skip those.
>>>>>> 
>>>>>> 
>>>>>> From the sidelines, I see a bunch of work intended for Mahout
>>>>>> 
>>>>>>> proceeding outside the community such as it is, and even Apache. Of
>>>>>>> course, contributions are always prepped externally to some degree.
>> I
>>>>>>> create, debug, change patches before posting them, maybe checking in
>>>>>>> early on choices that others may want input on.
>>>>>>> 
>>>>>>> This is a large-ish change being proposed, IIUC. I can see one
>> person
>>>>>>> who publicly, and at least two who privately, have clear
>> reservations
>>>>>>> about this direction.
>>>>>> 
>>>>>> 
>>>>>> It will probably be a large-ish change, indeed. But my personal take
>> is
>>>>>> that, non-technical aspects of the debate is unfortunately taking
>>>>>> precedence over real technical parts. Please refer to email thread
>> "Mahout
>>>>>> DSL vs Spark".
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> It certainly appears funny vis-a-vis the "Apache
>>>>>>> way" to work on a contribution *because* one (or more) other
>>>>>>> committers aren't convinced.
>>>>>> As mentioned in the referred email thread, a lot of the technical
>> issues
>>>>>> which got addressed in the work which was carried out outside of
>> Apache,
>>>>>> was really sorting out and highlighting build and classloader related
>>>>>> challenges on the H2O side. There was little motivation to carry out
>> those
>>>>>> discussions on the Mahout lists as it was really ~99% H2O specific
>>>>>> discussions and noise/spam to the Mahout community.
>>>>>> 
>>>>>> I don't think that's important to dither about. What is, is this: if
>> a
>>>>>> 
>>>>>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
>>>>>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
>>>>>>> yet, but it seems like a quite possible outcome.
>>>>>> As an outsider, my opinion is that the proposed need for a VOTE is a
>>>>>> largely masqueraded problem built around the perception of
>> disagreement
>>>>>> over something vague, abstract and inaccurate. And therefore
>> premature.
>>>>>> That being said the PMC may vote on any issues/non-issues it may
>> please.
>>>>>> 
>>>>>> Would be a shame to do a lot of work, intending it for a commit, and
>>>>>> 
>>>>>>> then find there is not consensus.
>>>>>> Exactly the kind of inaccurate perception I meant. While we are (at
>> least
>>>>>> I
>>>>>> am) exploring the best fit model for integration, and exploration by
>>>>>> definition involves taking potentially wrong steps and backtracking
>> if
>>>>>> necessary, the perception unfortunately seems to be that the proposed
>>>>>> intermediate (potentially wrong) steps are some kind of pre-decided
>> plan
>>>>>> of
>>>>>> action. So, no, there WOULDN'T be a lot of work intended for a commit
>>>>>> against consensus.
>>>>>> 
>>>>>> So is it better to figure out earlier than later whether these 2+
>>>>>> 
>>>>>>> parallel tracks have enough commonality to coexist?
>>>>>> 
>>>>>> 
>>>>>> Whether two parallel tracks (I assume the spark track and the H2O
>> track?)
>>>>>> have enough commonality to exist - one way you surely cannot get the
>> right
>>>>>> answer for this (except by co-incidence) is by taking a vote from a
>> group
>>>>>> who are experts in only either one of those tracks. From what I see,
>> most
>>>>>> of the opposition has been due to a combination of lack of
>> understanding
>>>>>> of
>>>>>> H2O and (welcome) skepticism. If, as a contributor, I find there is
>> no
>>>>>> natural or beneficial way to co-exist with Spark, I wouldn't waste
>> my time
>>>>>> writing code, and for sure am not dependent on another group's vote
>> to
>>>>>> make
>>>>>> that decision for me.
>>>>>> 
>>>>>> Avati
>

Re: Straw poll re: H2O ?

Posted by Pat Ferrel <pa...@gmail.com>.

I favor iterative dev (implement quick then refactor to perfection) and so no I would not try to make the DSL or specific imports support more than one backend _until_ we support one. Doing two at once? I’ve seen too many projects that fail because of this. Why not tackle equal support for multiple backends as a refactoring task after one backend is fully implemented?

Dmitriy is free to do as he wishes, of course, but he is spending an awful lot of time dealing with this. 

Multiple backends sounds fine in principal but the need has not be demonstrated, therefore it seems like a question to be deferred.

Have you considered doing a much scaled back, single algo implementation as an example? It wouldn’t require changing the DSL or any code already in Mahout and would add something useful quickly. All you need is file level compatibility. Besides it might go towards demonstrating the need.


On Apr 28, 2014, at 2:13 PM, Anand Avati <av...@gluster.org> wrote:

Saikat, Pat,

For background, please refer to the "Mahout DSL vs Spark" discussion the
for the general direction in which the integration is being explored. With
that background, I would like to present some counter questions:

1. Why is the DSL claiming to have (in its vision) logical vs physical
separation if not for providing multiple compute backends?

2. Does the proposal of having a new DSL backend in the future (for e.g
stratosphere as suggested elsewhere) make you:
-- propose mahout-stratosphere as a different top level project?
-- worry that stratosphere would be a dependency to Mahout?
-- worry that you won't be able to say "Future of Mahout is Spark .. but it
also supports stratosphere"?
-- worry that as a user/commiter/contributor you have to worry about a new
framework?
-- resist having a DSL backend for stratosphere because Hadoop vendors may
not support it?

Obviously no, since they are all just different DSL backends.

Have you guys embraced the idea that the DSL allows for multiple backends
(Spark being the first to get implemented)? or Not? Hence I do not
understand the "problem" here.

Thanks

On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <sx...@hotmail.com>wrote:

> I would echo Pat's sentiments spot on related to the goal of supporting
> both spark and H2O confusing folks that are interested in using, committing
> to and trying to understand where Mahout is headed small to medium term.
> I hate to throw this out but given the amount of "sometimes not so nice
> back and forths I've seen on issue 1500" I really wonder whether we should
> have mahout-spark and mahout-h2o as two different top level projects
> potentially supporting a different set of algorithms underneath, yes I know
> tieing mahout to a particular technology goes against the initial vision
> but given the churn I'm seeing I'm not sure I understand what the current
> vision even is :)
> 
>> Subject: Re: Straw poll re: H2O ?
>> From: pat@occamsmachete.com
>> Date: Mon, 28 Apr 2014 13:17:03 -0700
>> CC: ssc@apache.org
>> To: dev@mahout.apache.org
>> 
>> I haven’t heard a good explanation of what this project is. There should
> be some small step like implementing an algo on h2o to takes the same input
> as a current Hadoop Mahout job and produce the same result or do one not
> already in Mahout. At least it will answer some technical questions and
> shouldn’t take a lot of support from current committers to produce.
>> 
>> I’m still not convinced that this is the primary thing that should drive
> making it a Mahout dependency.
>> 
>> I’m highly dubious of actively supporting and working on Mahout for
> Spark and h2o. Not for technical reasons but because rebooting Mahout on
> two platforms seems a non-starter. No project manager in the commercial
> world would allow that sort of thing. And rightly so, it confuses users,
> committers, contributors. You shouldn’t have a great deal of redundancy or
> competing efforts _inside_ a project even an open source one. That’s for
> separate projects and the incubator, right? There are plenty of examples of
> going that route, Spark itself is redundant with Hadoop in many ways. Would
> Apache accept h2o as a parallel project to Spark, if so why not do that?
>> 
>> Question: Where do we (Mahout user, committer, contributor) invest
> extremely precious time learning new languages, frameworks, architecture,
> configurations, optimizations?
>> 
>> Answer: Many will simply not choose but wait and see, or go elsewhere.
>> 
>> Why? Because we fail to communicate “the future of Mahout is Spark
> first—period” It keeps coming out "Spark and, well, h2o too”
>> 
>> That is a momentum killer.  If we’re agreed on “Spark first” then
> there’s no need to incubate Mahout 2, Spark and Mahout have already gone
> through that and though Dmitriy’s DSL and Scala shell work is entirely new,
> to the end user the jobs, input and output, and functionality will look
> like a v2. People dealing with internals will see a different world but
> they should be a minority of users and will hopefully like what they see.
>> 
>> 
>> Somewhat off subject notes on external politics:
>> 
>> We really need to make sure Mahout stays in all the big distros. That
> means Sebastian’s comments are spot on: "The best way to help Mahout is to
> pick up some of the work that needs to be done with regards to
> documentation, examples, Hadoop 2 compatibility and designing the future,
> especially with regards to dataframes”  All the distros are hadoop 2.
>> 
>> Incubating Mahout 2 as another project is surely a way out of the
> distros, another momentum killer.
>> 
>> Another political question is whether an h2o dependency would be an
> issue to the distros. If we are going to put big efforts into h2o let’s see
> how that plays out first. Spark is already supported by them, even
> Hortonworks has taken a first step with 2.1. If Mahout is in a distro the
> distro will be asked to support it, that’s what they are paid for. Do they
> want to support h2o? I have no idea how they would react to that but it
> affects Mahout.
>> 
>> 
>> For all these reasons I’d be -1 to any big-bang integration.
>> 
>> 
>>> On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>>> 
>>> +1. I don't think anyone said anything, privately or publicly, about
> h20
>>> integration being a bad idea. It's just there's more than one way to
> do it,
>>> so debate is focusing on exploration of pluses and minuses of each
>>> individual proposal (as they come to light). Part of difficulty here
> was
>>> that the expertise intersection of all parts being connected and
> integrated
>>> has been pretty poor on individual basis. So we have to go by scenarios
>>> where a group of specialized experts tries to figure out the solution.
>>> 
>>> w.r.t to incubation proposals, it seems dubious for a number reasons.
>>> 
>>> Reason 1 is that these projects are the primary factor moving Mahout
>>> anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
>>> frankly not much left in Mahout, so it is reflection of more or less
> common
>>> opinion that the project would just spiral down on its own if the
> things
>>> stay status-quo.
>>> 
>>> Reason 2 is that there are good (not irreplaceable, but good)
> components in
>>> Mahout that these efforts depend on. Therefore, incubation would be
> faced
>>> with a perspective of having dependencies on project that on its own is
>>> winding down. Not good for incubation side.
>>> 
>>> Reason 3 is that current effort is (IMO) minimalistic enough not to
> warrant
>>> a new project. It simply doesn't, and can't have the scale of things
> like
>>> Spark or Hadoop eco. There would be just not enough substance for a new
>>> project at this point. I don't feel very strong about this point
> though.
>>> 
>>> 
>>> On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
>>> 
>>>> We all should calm down here and remind ourselves why we are doing
> this
>>>> whole thing: Because we love open source and want to have a vibrant
>>>> community and a great piece of software.
>>>> 
>>>> Mahout has come a long way and is at a crossroads right now, so its
> only
>>>> natural that there are heated discussions. But, we should immediately
> stop
>>>> the fingerpointing and related stuff, we have managed to avoid this
> since
>>>> Mahout's inception and we should continue to do so.
>>>> 
>>>> The best way to help Mahout is to pick up some of the work that needs
> to
>>>> be done with regards to documentation, examples, Hadoop 2
> compatibility and
>>>> designing the future, especially with regards to dataframes e.g.
>>>> 
>>>> We agreed to give the h2O guys a shot for exploration of a possible
>>>> integration into Mahout. We should be grateful that they are
> investing a
>>>> lot of time into this, and should help whereever we can. Once they
> come up
>>>> with a concrete proposal or patch, we will have a look at it, have a
> deep,
>>>> technical and polite discussion, and make a decision afterwards.
>>>> 
>>>> --sebastian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 04/28/2014 07:42 PM, Anand Avati wrote:
>>>> 
>>>>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>> 
>>>>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
>>>>>> <ji...@apache.org> wrote:
>>>>>> 
>>>>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly
> what
>>>>>>> 
>>>>>> is encouraging the h2o work to be done a bit apart. It simply isn't
>>>>>> efficient to have to answer so many off-topic points whenever any
> reports
>>>>>> on work in progress are given.
>>>>>> 
>>>>>>> 
>>>>>>> I think this has been the off-topic here.
>>>>>>> 
>>>>>>> Calling my comments "emotional" or "non-technical", or _loosely_
>>>>>>> 
>>>>>> paraphrasing me.
>>>>>> 
>>>>>> Yes, the personal finger-pointing parts don't belong and don't
>>>>>> convince anyone, let's skip those.
>>>>>> 
>>>>>> 
>>>>> +1. Let's skip those.
>>>>> 
>>>>> 
>>>>> From the sidelines, I see a bunch of work intended for Mahout
>>>>> 
>>>>>> proceeding outside the community such as it is, and even Apache. Of
>>>>>> course, contributions are always prepped externally to some degree.
> I
>>>>>> create, debug, change patches before posting them, maybe checking in
>>>>>> early on choices that others may want input on.
>>>>>> 
>>>>>> This is a large-ish change being proposed, IIUC. I can see one
> person
>>>>>> who publicly, and at least two who privately, have clear
> reservations
>>>>>> about this direction.
>>>>>> 
>>>>> 
>>>>> 
>>>>> It will probably be a large-ish change, indeed. But my personal take
> is
>>>>> that, non-technical aspects of the debate is unfortunately taking
>>>>> precedence over real technical parts. Please refer to email thread
> "Mahout
>>>>> DSL vs Spark".
>>>>> 
>>>>> 
>>>>> 
>>>>> It certainly appears funny vis-a-vis the "Apache
>>>>>> way" to work on a contribution *because* one (or more) other
>>>>>> committers aren't convinced.
>>>>>> 
>>>>>> 
>>>>> As mentioned in the referred email thread, a lot of the technical
> issues
>>>>> which got addressed in the work which was carried out outside of
> Apache,
>>>>> was really sorting out and highlighting build and classloader related
>>>>> challenges on the H2O side. There was little motivation to carry out
> those
>>>>> discussions on the Mahout lists as it was really ~99% H2O specific
>>>>> discussions and noise/spam to the Mahout community.
>>>>> 
>>>>> I don't think that's important to dither about. What is, is this: if
> a
>>>>> 
>>>>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
>>>>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
>>>>>> yet, but it seems like a quite possible outcome.
>>>>>> 
>>>>>> 
>>>>> As an outsider, my opinion is that the proposed need for a VOTE is a
>>>>> largely masqueraded problem built around the perception of
> disagreement
>>>>> over something vague, abstract and inaccurate. And therefore
> premature.
>>>>> That being said the PMC may vote on any issues/non-issues it may
> please.
>>>>> 
>>>>> Would be a shame to do a lot of work, intending it for a commit, and
>>>>> 
>>>>>> then find there is not consensus.
>>>>>> 
>>>>>> 
>>>>> Exactly the kind of inaccurate perception I meant. While we are (at
> least
>>>>> I
>>>>> am) exploring the best fit model for integration, and exploration by
>>>>> definition involves taking potentially wrong steps and backtracking
> if
>>>>> necessary, the perception unfortunately seems to be that the proposed
>>>>> intermediate (potentially wrong) steps are some kind of pre-decided
> plan
>>>>> of
>>>>> action. So, no, there WOULDN'T be a lot of work intended for a commit
>>>>> against consensus.
>>>>> 
>>>>> So is it better to figure out earlier than later whether these 2+
>>>>> 
>>>>>> parallel tracks have enough commonality to coexist?
>>>>>> 
>>>>> 
>>>>> 
>>>>> Whether two parallel tracks (I assume the spark track and the H2O
> track?)
>>>>> have enough commonality to exist - one way you surely cannot get the
> right
>>>>> answer for this (except by co-incidence) is by taking a vote from a
> group
>>>>> who are experts in only either one of those tracks. From what I see,
> most
>>>>> of the opposition has been due to a combination of lack of
> understanding
>>>>> of
>>>>> H2O and (welcome) skepticism. If, as a contributor, I find there is
> no
>>>>> natural or beneficial way to co-exist with Spark, I wouldn't waste
> my time
>>>>> writing code, and for sure am not dependent on another group's vote
> to
>>>>> make
>>>>> that decision for me.
>>>>> 
>>>>> Avati
>>>>> 
>>>>> 
>>>> 
>>> 
> 
>

Re: Straw poll re: H2O ?

Posted by Ted Dunning <te...@gmail.com>.

That would be nice, but it is based on my personal and unpublished
evaluation based on personal use.  This isn't a formal evaluation.

We should encourage the 0xdata team to show us what they can do.



On Thu, May 1, 2014 at 1:25 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> sure. I assume this should include statements that something crushes
> something without providing a link to a published analysis of what it is
> something that crushes something another and due to what something.
>
>
> On Wed, Apr 30, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > It seems to me that Sebastian and Ellen have hit on the right tack.
> >
> > Let's get back to work making something cool here.  Let's build this
> > community up instead of having endlessly divisive discussions.
> >
> > Let's get back to the Apache emphasis on do-acracy.
> >
> >
> >
> > On Wed, Apr 30, 2014 at 11:36 AM, Ellen Friedman <
> > b.ellen.friedman@gmail.com
> > > wrote:
> >
> > > I am weighing in here on issues of great concern but non-technical.
> > >
> > > 1. One of the great things about Mahout is the community – not an easy
> > > thing to have achieved given that people are dispersed geographically
> > > and there is no single focus or company backing the project. In short,
> > > the people who make Mahout are doing something cool.
> > >
> > > Suggestions to try to break it into different groups, Mahout-Spark and
> > > Mahout2o, run counter to this success. Why fragment it at exactly the
> > > moment when new contributors (from 0xdata) are coming forward ?  The
> > > spirit of this project has been inclusive. Let's not  change that now.
> > >
> > > 2. Sebastian pointed out:
> > >
> > > "We agreed to give the h2O guys a shot for exploration of a possible
> > > integration into Mahout. We should be grateful that they are investing
> > > a lot of time into this, and should help whereever we can. Once they
> > > come up with a concrete proposal or patch, we will have a look at it,
> > > have a deep, technical and polite discussion, and make a decision
> > > afterwards."
> > >
> > > +1
> > >
> > > We agreed to explore the h2o option. Why use of lots of time and
> > > energy in re-visiting and second guessing that decision? Let it go
> > > forward, likely some great things will emerge for Mahout, and if not,
> > > then we say "thank you" to h2o contributors for giving it a try.
> > >
> > > As the guys from h2o are adding new resources to do this development,
> > > it is not really detracting anything from Mahout's resources except
> > > when someone opens one of these discussions that lead to fragmentation
> > > and distraction. I'm not a coder and not as technical as any of you,
> > > but from my view It seems to be the talk and not the development that
> > > is distracting.
> > >
> > > 3. Over the last year, there has been growing and widespread interest
> > > in Mahout from the outside world, and now, with the new changes to
> > > support Scala, Spark and h2o (possibly Stratosphere later) the growing
> > > interest has turned into excitement. This is a great time for the
> > > project – tons of effort but moving toward a big result.
> > >
> > > Users will have some excellent new choices, all parts of Mahout will
> > > benefit. And if in the future it is seen that some of the new features
> > > are not being widely or successfully used, they will be deprecated, as
> > > was done during the big clean-up of the 0.8 release. New choices, new
> > > ways to use Mahout, new people getting involved – this is excellent.
> > >
> > > 4. My thought is, stick together, embrace change, welcome new comers
> > > and be very proud to be building the new Mahout.
> > >
> > >
> > >
> > > On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
> > > > For reasons of transparency in this discussion, I should add that I
> am
> > a
> > > > committer on the upcoming Stratosphere ASF podling, co-worker of the
> > > > main developers and have contributed to it as part of my PhD.
> > > >
> > > > On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
> > > >> Anand,
> > > >>
> > > >> I'm trying to answer some of your questions, and my answers
> highlight
> > > >> the points that I would like to see clarified about h20.
> > > >>
> > > >> On 04/28/2014 11:13 PM, Anand Avati wrote:
> > > >>
> > > >>> 1. Why is the DSL claiming to have (in its vision) logical vs
> > physical
> > > >>> separation if not for providing multiple compute backends?
> > > >>
> > > >> This is not a claim or a vision, the DSL already has this
> separation.
> > > >> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the
> logical
> > > >> operator for executing a Transpose-Times-Self matrix multiplication.
> > In
> > > >> o.a.m.sparkbindings.blas.AtA you will find two physical operator
> > > >> implementations for that. The choice which one to use depends on
> > whether
> > > >> there is enough memory to hold certain intermediary results in
> memory.
> > > >>
> > > >> The primary intention of a separation into logical and physical
> > > >> operators is to allow for a declarative programming style on the
> users
> > > >> side and for an optimizer on the system side which automatically
> > chooses
> > > >> the optimal physical operator for the execution of a specific
> program.
> > > >>
> > > >> This choice of the physical operator might depend on the shape and
> > > >> amount of the data processed as well on the underlying available
> > > >> resources. *The separation into logical and physical operators
> clearly
> > > >> doesn't imply to have multiple backends*. It only makes it very easy
> > to
> > > >> support them.
> > > >>
> > > >>>
> > > >>> 2. Does the proposal of having a new DSL backend in the future (for
> > e.g
> > > >>> stratosphere as suggested elsewhere) make you:
> > > >>
> > > >>> -- worry that stratosphere would be a dependency to Mahout?
> > > >>
> > > >> Stratosphere has been accepted as a incubator project in the ASF
> > > >> recently, so the worry about such a dependency is naturally less
> than
> > > >> about an externally managed project like h20.
> > > >>
> > > >>> -- worry that as a user/commiter/contributor you have to worry
> about
> > a
> > > >>> new
> > > >>> framework?
> > > >>
> > > >> In my eyes, there is a big difference between Spark/Stratosphere and
> > > >> h20. Spark and Stratosphere have a clearly defined programming and
> > > >> execution model. They execute programs that are composed of a DAG of
> > > >> operators. The set of operators has clearly defined semantics and
> > > >> parallelization strategies. If you compare their operators, you will
> > > >> find that they offer pretty much the same in lightly different
> > flavors.
> > > >> For both, there are scientific papers that in detail explain all
> these
> > > >> things.
> > > >>
> > > >> I have asked about a detailed description of h20's programming model
> > and
> > > >> execution model and I searched the documentation, but I haven't been
> > > >> able to find something that clearly describes how things are done. I
> > > >> would love to read up on this, but until I'm presented with this, I
> > have
> > > >> to assume that such a principled foundation is missing.
> > > >>
> > > >>
> > > >> --sebastian
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

sure. I assume this should include statements that something crushes
something without providing a link to a published analysis of what it is
something that crushes something another and due to what something.


On Wed, Apr 30, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com> wrote:

> It seems to me that Sebastian and Ellen have hit on the right tack.
>
> Let's get back to work making something cool here.  Let's build this
> community up instead of having endlessly divisive discussions.
>
> Let's get back to the Apache emphasis on do-acracy.
>
>
>
> On Wed, Apr 30, 2014 at 11:36 AM, Ellen Friedman <
> b.ellen.friedman@gmail.com
> > wrote:
>
> > I am weighing in here on issues of great concern but non-technical.
> >
> > 1. One of the great things about Mahout is the community – not an easy
> > thing to have achieved given that people are dispersed geographically
> > and there is no single focus or company backing the project. In short,
> > the people who make Mahout are doing something cool.
> >
> > Suggestions to try to break it into different groups, Mahout-Spark and
> > Mahout2o, run counter to this success. Why fragment it at exactly the
> > moment when new contributors (from 0xdata) are coming forward ?  The
> > spirit of this project has been inclusive. Let's not  change that now.
> >
> > 2. Sebastian pointed out:
> >
> > "We agreed to give the h2O guys a shot for exploration of a possible
> > integration into Mahout. We should be grateful that they are investing
> > a lot of time into this, and should help whereever we can. Once they
> > come up with a concrete proposal or patch, we will have a look at it,
> > have a deep, technical and polite discussion, and make a decision
> > afterwards."
> >
> > +1
> >
> > We agreed to explore the h2o option. Why use of lots of time and
> > energy in re-visiting and second guessing that decision? Let it go
> > forward, likely some great things will emerge for Mahout, and if not,
> > then we say "thank you" to h2o contributors for giving it a try.
> >
> > As the guys from h2o are adding new resources to do this development,
> > it is not really detracting anything from Mahout's resources except
> > when someone opens one of these discussions that lead to fragmentation
> > and distraction. I'm not a coder and not as technical as any of you,
> > but from my view It seems to be the talk and not the development that
> > is distracting.
> >
> > 3. Over the last year, there has been growing and widespread interest
> > in Mahout from the outside world, and now, with the new changes to
> > support Scala, Spark and h2o (possibly Stratosphere later) the growing
> > interest has turned into excitement. This is a great time for the
> > project – tons of effort but moving toward a big result.
> >
> > Users will have some excellent new choices, all parts of Mahout will
> > benefit. And if in the future it is seen that some of the new features
> > are not being widely or successfully used, they will be deprecated, as
> > was done during the big clean-up of the 0.8 release. New choices, new
> > ways to use Mahout, new people getting involved – this is excellent.
> >
> > 4. My thought is, stick together, embrace change, welcome new comers
> > and be very proud to be building the new Mahout.
> >
> >
> >
> > On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
> > > For reasons of transparency in this discussion, I should add that I am
> a
> > > committer on the upcoming Stratosphere ASF podling, co-worker of the
> > > main developers and have contributed to it as part of my PhD.
> > >
> > > On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
> > >> Anand,
> > >>
> > >> I'm trying to answer some of your questions, and my answers highlight
> > >> the points that I would like to see clarified about h20.
> > >>
> > >> On 04/28/2014 11:13 PM, Anand Avati wrote:
> > >>
> > >>> 1. Why is the DSL claiming to have (in its vision) logical vs
> physical
> > >>> separation if not for providing multiple compute backends?
> > >>
> > >> This is not a claim or a vision, the DSL already has this separation.
> > >> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
> > >> operator for executing a Transpose-Times-Self matrix multiplication.
> In
> > >> o.a.m.sparkbindings.blas.AtA you will find two physical operator
> > >> implementations for that. The choice which one to use depends on
> whether
> > >> there is enough memory to hold certain intermediary results in memory.
> > >>
> > >> The primary intention of a separation into logical and physical
> > >> operators is to allow for a declarative programming style on the users
> > >> side and for an optimizer on the system side which automatically
> chooses
> > >> the optimal physical operator for the execution of a specific program.
> > >>
> > >> This choice of the physical operator might depend on the shape and
> > >> amount of the data processed as well on the underlying available
> > >> resources. *The separation into logical and physical operators clearly
> > >> doesn't imply to have multiple backends*. It only makes it very easy
> to
> > >> support them.
> > >>
> > >>>
> > >>> 2. Does the proposal of having a new DSL backend in the future (for
> e.g
> > >>> stratosphere as suggested elsewhere) make you:
> > >>
> > >>> -- worry that stratosphere would be a dependency to Mahout?
> > >>
> > >> Stratosphere has been accepted as a incubator project in the ASF
> > >> recently, so the worry about such a dependency is naturally less than
> > >> about an externally managed project like h20.
> > >>
> > >>> -- worry that as a user/commiter/contributor you have to worry about
> a
> > >>> new
> > >>> framework?
> > >>
> > >> In my eyes, there is a big difference between Spark/Stratosphere and
> > >> h20. Spark and Stratosphere have a clearly defined programming and
> > >> execution model. They execute programs that are composed of a DAG of
> > >> operators. The set of operators has clearly defined semantics and
> > >> parallelization strategies. If you compare their operators, you will
> > >> find that they offer pretty much the same in lightly different
> flavors.
> > >> For both, there are scientific papers that in detail explain all these
> > >> things.
> > >>
> > >> I have asked about a detailed description of h20's programming model
> and
> > >> execution model and I searched the documentation, but I haven't been
> > >> able to find something that clearly describes how things are done. I
> > >> would love to read up on this, but until I'm presented with this, I
> have
> > >> to assume that such a principled foundation is missing.
> > >>
> > >>
> > >> --sebastian
> > >>
> > >
> > >
> >
>

Re: Straw poll re: H2O ?

Posted by Ted Dunning <te...@gmail.com>.

It seems to me that Sebastian and Ellen have hit on the right tack.

Let's get back to work making something cool here.  Let's build this
community up instead of having endlessly divisive discussions.

Let's get back to the Apache emphasis on do-acracy.



On Wed, Apr 30, 2014 at 11:36 AM, Ellen Friedman <b.ellen.friedman@gmail.com
> wrote:

> I am weighing in here on issues of great concern but non-technical.
>
> 1. One of the great things about Mahout is the community – not an easy
> thing to have achieved given that people are dispersed geographically
> and there is no single focus or company backing the project. In short,
> the people who make Mahout are doing something cool.
>
> Suggestions to try to break it into different groups, Mahout-Spark and
> Mahout2o, run counter to this success. Why fragment it at exactly the
> moment when new contributors (from 0xdata) are coming forward ?  The
> spirit of this project has been inclusive. Let's not  change that now.
>
> 2. Sebastian pointed out:
>
> "We agreed to give the h2O guys a shot for exploration of a possible
> integration into Mahout. We should be grateful that they are investing
> a lot of time into this, and should help whereever we can. Once they
> come up with a concrete proposal or patch, we will have a look at it,
> have a deep, technical and polite discussion, and make a decision
> afterwards."
>
> +1
>
> We agreed to explore the h2o option. Why use of lots of time and
> energy in re-visiting and second guessing that decision? Let it go
> forward, likely some great things will emerge for Mahout, and if not,
> then we say "thank you" to h2o contributors for giving it a try.
>
> As the guys from h2o are adding new resources to do this development,
> it is not really detracting anything from Mahout's resources except
> when someone opens one of these discussions that lead to fragmentation
> and distraction. I'm not a coder and not as technical as any of you,
> but from my view It seems to be the talk and not the development that
> is distracting.
>
> 3. Over the last year, there has been growing and widespread interest
> in Mahout from the outside world, and now, with the new changes to
> support Scala, Spark and h2o (possibly Stratosphere later) the growing
> interest has turned into excitement. This is a great time for the
> project – tons of effort but moving toward a big result.
>
> Users will have some excellent new choices, all parts of Mahout will
> benefit. And if in the future it is seen that some of the new features
> are not being widely or successfully used, they will be deprecated, as
> was done during the big clean-up of the 0.8 release. New choices, new
> ways to use Mahout, new people getting involved – this is excellent.
>
> 4. My thought is, stick together, embrace change, welcome new comers
> and be very proud to be building the new Mahout.
>
>
>
> On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
> > For reasons of transparency in this discussion, I should add that I am a
> > committer on the upcoming Stratosphere ASF podling, co-worker of the
> > main developers and have contributed to it as part of my PhD.
> >
> > On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
> >> Anand,
> >>
> >> I'm trying to answer some of your questions, and my answers highlight
> >> the points that I would like to see clarified about h20.
> >>
> >> On 04/28/2014 11:13 PM, Anand Avati wrote:
> >>
> >>> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> >>> separation if not for providing multiple compute backends?
> >>
> >> This is not a claim or a vision, the DSL already has this separation.
> >> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
> >> operator for executing a Transpose-Times-Self matrix multiplication. In
> >> o.a.m.sparkbindings.blas.AtA you will find two physical operator
> >> implementations for that. The choice which one to use depends on whether
> >> there is enough memory to hold certain intermediary results in memory.
> >>
> >> The primary intention of a separation into logical and physical
> >> operators is to allow for a declarative programming style on the users
> >> side and for an optimizer on the system side which automatically chooses
> >> the optimal physical operator for the execution of a specific program.
> >>
> >> This choice of the physical operator might depend on the shape and
> >> amount of the data processed as well on the underlying available
> >> resources. *The separation into logical and physical operators clearly
> >> doesn't imply to have multiple backends*. It only makes it very easy to
> >> support them.
> >>
> >>>
> >>> 2. Does the proposal of having a new DSL backend in the future (for e.g
> >>> stratosphere as suggested elsewhere) make you:
> >>
> >>> -- worry that stratosphere would be a dependency to Mahout?
> >>
> >> Stratosphere has been accepted as a incubator project in the ASF
> >> recently, so the worry about such a dependency is naturally less than
> >> about an externally managed project like h20.
> >>
> >>> -- worry that as a user/commiter/contributor you have to worry about a
> >>> new
> >>> framework?
> >>
> >> In my eyes, there is a big difference between Spark/Stratosphere and
> >> h20. Spark and Stratosphere have a clearly defined programming and
> >> execution model. They execute programs that are composed of a DAG of
> >> operators. The set of operators has clearly defined semantics and
> >> parallelization strategies. If you compare their operators, you will
> >> find that they offer pretty much the same in lightly different flavors.
> >> For both, there are scientific papers that in detail explain all these
> >> things.
> >>
> >> I have asked about a detailed description of h20's programming model and
> >> execution model and I searched the documentation, but I haven't been
> >> able to find something that clearly describes how things are done. I
> >> would love to read up on this, but until I'm presented with this, I have
> >> to assume that such a principled foundation is missing.
> >>
> >>
> >> --sebastian
> >>
> >
> >
>

Re: Straw poll re: H2O ?

Posted by Cliff Click <cc...@gmail.com>.

Wow, the mahout mailing list is busy.  Pardon me if I don't keep up with 
all that flows thru....

The compression/decompression stuff is tightly coupled with a few parts 
of H2O.
The CSV file parser does compression on data-inhale, and this is crucial 
in parsing datasets that fit in DRAM - after compression, but not before 
compression - without requiring an intermediate spill-to-disk.  (Also 
"CSV" is loosely defined to cover any line/field oriented text file, 
including e.g. Hive files, tab-separated, etc)

The data is aligned across the cluster in a way to allow fast parallel 
access in the common cases of e.g. R data-frame-like access.  The 
expected column-count is expected to be modest - a few 1000's to a max 
of perhaps a million.  No limit on rows (other than what fits in DRAM).  
The compression follows the alignment in DRAM.

If you want "just the compression", you'll have to take the 
column-alignment and distribution also.

Ted mentioned byte-code injection - H2O does that for serialization of 
POJO's (Plain Olde Java Objects).  I'll warrant that H2O has the fastest 
serialize/deserialize path on the planet (with heavy compression of POJO 
primitive arrays specifically for ML algorithms)... but it's not 
directly tied to the compression of Big Data in ram.  The next step up 
in speed for decompressing the Big Data WOULD be do to byte-code 
injection, however.  So far we're outrunning memory bandwidth and don't 
need BCI for the Big Data (but clearly need it for the POJOs).

Cliff

On 4/30/2014 4:49 PM, Ted Dunning wrote:
> I couldn't say.
>
> Let's invite the 0xdata to show us what can happen.
>
>
>
>
> On Thu, May 1, 2014 at 1:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> This is interesting. And this happens in one node. Can it be decoupled from
>> parallelization concerns and re-used? (proposal D)
>>
>>
>> On Wed, Apr 30, 2014 at 4:09 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>> I should add that the way that the compression is done is pretty cool for
>>> speed.  The basic idea is that byte code engineering is used to directly
>>> inject the decompression and compression code into the user code.  This
>>> allows format conditionals to be hoisted outside the parallel loop
>>> entirely.  This drops decompression overhead to just a few cycles.  This
>> is
>>> necessary because the point is to allow the inner loop to proceed at L1
>>> speeds instead of L3 speeds (really L3 / compression ratio).
>>>
>>>
>>>
>>> On Thu, May 1, 2014 at 12:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>>
>>>> On Wed, Apr 30, 2014 at 3:24 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>> Inline
>>>>>
>>>>>
>>>>> On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>> wrote:
>>>>>
>>>>>> On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <
>> ted.dunning@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> My motivation to accept comes from the fact that they have
>> machine
>>>>>> learning
>>>>>>> codes that are as fast as what google has internally.  They
>>>> completely
>>>>>>> crush all of the spark efforts on speed.
>>>>>>>
>>>>>> correct me if i am wrong. h20 performance strengths come from speed
>>> of
>>>>>> in-core computations and efficient compression (that's what i heard
>>> at
>>>>>> least).
>>>>>>
>>>>> Those two factors are key.  In addition, the ability to dispatch
>>> parallel
>>>>> computations with microsecond latencies is also important as well as
>>> the
>>>>> ability to transparently communicate at high speeds between processes
>>>> both
>>>>> local and remote.
>>>>>
>>>> This is kind of an old news.  They all do, for years now. I've been
>>>> building a system that does real time distributed pipelines (~30 ms to
>>>> start all steps in pipeline + in-core complexity)  for years.  Note
>> that
>>>> node-to-node hop in clouds are usually mean at about 10ms so
>> microseconds
>>>> are kind of out of question for network performance reasons in real
>> life
>>>> except for private racks.
>>>>
>>>> The only thing that doesn't do this is the MR variety of Hadoop.
>>>>

Re: Straw poll re: H2O ?

Posted by Ted Dunning <te...@gmail.com>.

I couldn't say.

Let's invite the 0xdata to show us what can happen.




On Thu, May 1, 2014 at 1:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> This is interesting. And this happens in one node. Can it be decoupled from
> parallelization concerns and re-used? (proposal D)
>
>
> On Wed, Apr 30, 2014 at 4:09 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I should add that the way that the compression is done is pretty cool for
> > speed.  The basic idea is that byte code engineering is used to directly
> > inject the decompression and compression code into the user code.  This
> > allows format conditionals to be hoisted outside the parallel loop
> > entirely.  This drops decompression overhead to just a few cycles.  This
> is
> > necessary because the point is to allow the inner loop to proceed at L1
> > speeds instead of L3 speeds (really L3 / compression ratio).
> >
> >
> >
> > On Thu, May 1, 2014 at 12:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > On Wed, Apr 30, 2014 at 3:24 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Inline
> > > >
> > > >
> > > > On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > My motivation to accept comes from the fact that they have
> machine
> > > > > learning
> > > > > > codes that are as fast as what google has internally.  They
> > > completely
> > > > > > crush all of the spark efforts on speed.
> > > > > >
> > > > >
> > > > > correct me if i am wrong. h20 performance strengths come from speed
> > of
> > > > > in-core computations and efficient compression (that's what i heard
> > at
> > > > > least).
> > > > >
> > > >
> > > > Those two factors are key.  In addition, the ability to dispatch
> > parallel
> > > > computations with microsecond latencies is also important as well as
> > the
> > > > ability to transparently communicate at high speeds between processes
> > > both
> > > > local and remote.
> > > >
> > >
> > > This is kind of an old news.  They all do, for years now. I've been
> > > building a system that does real time distributed pipelines (~30 ms to
> > > start all steps in pipeline + in-core complexity)  for years.  Note
> that
> > > node-to-node hop in clouds are usually mean at about 10ms so
> microseconds
> > > are kind of out of question for network performance reasons in real
> life
> > > except for private racks.
> > >
> > > The only thing that doesn't do this is the MR variety of Hadoop.
> > >
> >
>

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

This is interesting. And this happens in one node. Can it be decoupled from
parallelization concerns and re-used? (proposal D)


On Wed, Apr 30, 2014 at 4:09 PM, Ted Dunning <te...@gmail.com> wrote:

> I should add that the way that the compression is done is pretty cool for
> speed.  The basic idea is that byte code engineering is used to directly
> inject the decompression and compression code into the user code.  This
> allows format conditionals to be hoisted outside the parallel loop
> entirely.  This drops decompression overhead to just a few cycles.  This is
> necessary because the point is to allow the inner loop to proceed at L1
> speeds instead of L3 speeds (really L3 / compression ratio).
>
>
>
> On Thu, May 1, 2014 at 12:35 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > On Wed, Apr 30, 2014 at 3:24 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Inline
> > >
> > >
> > > On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > >
> > > > > My motivation to accept comes from the fact that they have machine
> > > > learning
> > > > > codes that are as fast as what google has internally.  They
> > completely
> > > > > crush all of the spark efforts on speed.
> > > > >
> > > >
> > > > correct me if i am wrong. h20 performance strengths come from speed
> of
> > > > in-core computations and efficient compression (that's what i heard
> at
> > > > least).
> > > >
> > >
> > > Those two factors are key.  In addition, the ability to dispatch
> parallel
> > > computations with microsecond latencies is also important as well as
> the
> > > ability to transparently communicate at high speeds between processes
> > both
> > > local and remote.
> > >
> >
> > This is kind of an old news.  They all do, for years now. I've been
> > building a system that does real time distributed pipelines (~30 ms to
> > start all steps in pipeline + in-core complexity)  for years.  Note that
> > node-to-node hop in clouds are usually mean at about 10ms so microseconds
> > are kind of out of question for network performance reasons in real life
> > except for private racks.
> >
> > The only thing that doesn't do this is the MR variety of Hadoop.
> >
>

Re: Straw poll re: H2O ?

Posted by Ted Dunning <te...@gmail.com>.

I should add that the way that the compression is done is pretty cool for
speed.  The basic idea is that byte code engineering is used to directly
inject the decompression and compression code into the user code.  This
allows format conditionals to be hoisted outside the parallel loop
entirely.  This drops decompression overhead to just a few cycles.  This is
necessary because the point is to allow the inner loop to proceed at L1
speeds instead of L3 speeds (really L3 / compression ratio).



On Thu, May 1, 2014 at 12:35 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Wed, Apr 30, 2014 at 3:24 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Inline
> >
> >
> > On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > >
> > > > My motivation to accept comes from the fact that they have machine
> > > learning
> > > > codes that are as fast as what google has internally.  They
> completely
> > > > crush all of the spark efforts on speed.
> > > >
> > >
> > > correct me if i am wrong. h20 performance strengths come from speed of
> > > in-core computations and efficient compression (that's what i heard at
> > > least).
> > >
> >
> > Those two factors are key.  In addition, the ability to dispatch parallel
> > computations with microsecond latencies is also important as well as the
> > ability to transparently communicate at high speeds between processes
> both
> > local and remote.
> >
>
> This is kind of an old news.  They all do, for years now. I've been
> building a system that does real time distributed pipelines (~30 ms to
> start all steps in pipeline + in-core complexity)  for years.  Note that
> node-to-node hop in clouds are usually mean at about 10ms so microseconds
> are kind of out of question for network performance reasons in real life
> except for private racks.
>
> The only thing that doesn't do this is the MR variety of Hadoop.
>

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, May 2, 2014 at 2:29 PM, Cliff Click <cc...@gmail.com> wrote:

> Exactly what we're trying to do is scope out that effort, with the goal of
> being able to reuse Mahout algos.
> Mahout's well-known for it's e.g. recommendation engine; we'd like to see
> that (and other algos) leveraged.
>

Leveraged -- in generic form or h20-specific form (a ported form)?


> So yes, "integrate" with Mahout - and help people already using Mahout get
> something better without having to drop Mahout altogether.
>
> Cliff
>
>
>
>
> On 5/1/2014 9:21 AM, Dmitriy Lyubimov wrote:
>
>> What i really want to ask you is this. I asked this question before, and
>> did not get answer. But I assume now you know a bit more about Mahout.
>>
>> What is your opinion/vision on actually _integrating_ with Mahout?
>>
>> Integration effort in my definition would be (1) reusing some of Mahout
>> implementations, and/or (2) helping some of mahout algorithms/components
>> to
>> do their job better.
>>
>> What you have been doing to date was something roughly amounting to
>>   building (porting) Mahout (or non-Mahout) algorithms for H20. This, by
>> definition, is not an integration effort and could happily run forever
>> without ever requiring a Mahout commit.
>>
>> I would be interested to hear your thoughts again on what you think it
>> means to _integrate_ with Mahout.
>>
>

Re: Straw poll re: H2O ?

Posted by Cliff Click <cc...@gmail.com>.

Exactly what we're trying to do is scope out that effort, with the goal 
of being able to reuse Mahout algos.
Mahout's well-known for it's e.g. recommendation engine; we'd like to 
see that (and other algos) leveraged.
So yes, "integrate" with Mahout - and help people already using Mahout 
get something better without having to drop Mahout altogether.

Cliff



On 5/1/2014 9:21 AM, Dmitriy Lyubimov wrote:
> What i really want to ask you is this. I asked this question before, and
> did not get answer. But I assume now you know a bit more about Mahout.
>
> What is your opinion/vision on actually _integrating_ with Mahout?
>
> Integration effort in my definition would be (1) reusing some of Mahout
> implementations, and/or (2) helping some of mahout algorithms/components to
> do their job better.
>
> What you have been doing to date was something roughly amounting to
>   building (porting) Mahout (or non-Mahout) algorithms for H20. This, by
> definition, is not an integration effort and could happily run forever
> without ever requiring a Mahout commit.
>
> I would be interested to hear your thoughts again on what you think it
> means to _integrate_ with Mahout.

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Thank you Cliff, this is excellent. Obviously you guys invested a lot of
time and private money in engineering aspects of vector serialization and
instrumenting byte codes.

What i really want to ask you is this. I asked this question before, and
did not get answer. But I assume now you know a bit more about Mahout.

What is your opinion/vision on actually _integrating_ with Mahout?

Integration effort in my definition would be (1) reusing some of Mahout
implementations, and/or (2) helping some of mahout algorithms/components to
do their job better.

What you have been doing to date was something roughly amounting to
 building (porting) Mahout (or non-Mahout) algorithms for H20. This, by
definition, is not an integration effort and could happily run forever
without ever requiring a Mahout commit.

I would be interested to hear your thoughts again on what you think it
means to _integrate_ with Mahout.

On Thu, May 1, 2014 at 8:40 AM, Cliff Click <cc...@gmail.com> wrote:

> H2O will launch an internal Task in the single-digit microsecond range.
>  Because of this, we can launch 100,000's (millions?) a second... leading
> to fine-grained data parallelism, and high CPU utilization.  This is a big
> piece of our single-node speed.  Some other distributed Task-launching
> solutions I've seen tend to require a network-hop per-task... leading to
> your 10ms to launch as task requirement, leading to a limit of a few 1000
> Tasks/sec requiring tasks that are much larger and coarser than H2O's...
> leading to much lower CPU utilization.
>
> Also, I'm getting 200micro-second ping's between my datacenter
> machines.... down from 10msec.  It's decent commodity hardware, nothing
> special.  Meaning: H2O can launch task on an entire 32-node cluster in
> about 1msec, starting from a single driving node (log-tree fanout, depth 5,
> 200micro-second single UDP packet launch, 1micro-second internal task
> launch).
>
> And this latency matters when the work itself is lots and lots "small"
> jobs, as is common when a DSL such as Mahout or Spark/Scala or R is driving
> simple operators over bulk data.
>
> Cliff
>
>
>
> On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote:
>
>> This is kind of an old news. They all do, for years now. I've been
>> building a system that does real time distributed pipelines (~30 ms to
>> start all steps in pipeline + in-core complexity) for years. Note that
>> node-to-node hop in clouds are usually mean at about 10ms so microseconds
>> are kind of out of question for network performance reasons in real life
>> except for private racks. The only thing that doesn't do this is the MR
>> variety of Hadoop.
>>
>
>

Re: Straw poll re: H2O ?

Posted by Pat Ferrel <pa...@gmail.com>.

OK, then how about h2o kmeans v MLlib?

On May 1, 2014, at 9:20 AM, Cliff Click <cc...@gmail.com> wrote:

Naw - the KMeans was strictly as a "sample" attempt to blend the H2O & Mahout coding.

Goal really is to get feedback from this group on how well that attempt is working.
Is there a better API?  What is it?
What can be improved?
How clumsy is the current marrage of H2OMatrix vs Matrix?
What's the mental cost of H2O's "tall skinny data" vs Mahout's All-The-Worlds-A-(squarish)-Matrix model?

Right now we're working on cleaning up the H2O internal DSL to make it better support either Spark/Scala and/or Dmitriy's DSL - plus also our commitment to running R.  I'm hoping Mahout volunteers will peek at it

 https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java

and comment before we go further down this path.

If - ultimately - Mahout decides to drop the current Matrix API for something more bulk/scale/parallel friendly - we're happy to go along with that also.

Cliff

On 5/1/2014 9:01 AM, Pat Ferrel wrote:
> Odd that the Kmeans implementation isn’t a way to demonstrate performance. Seems like anyone could grab that and try it with the same data on MLlib and perform a principled analysis. Or just run the same data through h2o and MLlib. This seems like a good way to look at the forrest instead of the trees.
> 
> BTW any generalization effort to support two execution engines will have to abstract away the SparkContext. This is where IO, job control, and engine tuning happens. Abstracting the DSL is not sufficient. Any hypothetical MahoutContext (a good idea for sure) if it deviated significantly from a SparkContext will have broad impact.
> 
> http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext
> 
>

Re: Straw poll re: H2O ?

Posted by Cliff Click <cc...@gmail.com>.

"detailed description of h20's programming and execution model."

No *formal* documentation for this exists; been no time to write such a 
thing.
There's easy-to-find slide-share & video talks.  Here are two:
  - http://www.infoq.com/presentations/api-memory-analytics
  - http://www.infoq.com/interviews/click-0xdata

Summary:
- A high-performance in-memory K/V store (cache-hits are 150 nano's, 
misses depend on network transfer times).  Supports full JMM exact 
semantics & transactions.  Used to hold the Big Data & to control 
computations
- Big Data support via Frames/Vecs/Chunks - see the above slides for 
graphical overview; compression "is a implementation feature" but not 
visible in the execution model except as speed or size constraints.
- A well-tuned data-ingestion system
- Map/Reduce coding style, uses Java 1.7's Fork/Join on a single-node, 
but distributed across nodes.  Maps are fine-grained F/J tasks and can 
produce both a Big output (distributed parallel writing to Frames/Vecs) 
and a Small output (anything in a POJO). Reductions are also 
fine-grained, and happen anytime 2 maps are done... so separate 
"reduction" phase.  Not the hadoop M/R - no sort or shuffle steps, 
everything in DRAM.
- REST/JSON access to most algo's & coding.  Web browser/html over that.
- Internal DSL - A work-in-progress.  Right now converts a subset of the 
R language to AST's, then executes the AST's.  Covers a fairly large 
subset of the bulk/array operators in R, and expressions built thereof.  
Includes 1st-class functions and e.g. GroupBy (ddply in R lingo).  
Expressions like "|apply(someFrame,2,function(x){ 
ifelse(is.na(x),mean(x),x)})|" will replace NA's in "someFrame" with the 
mean of the column.  It's R syntax (or very close to R), not Scala.

Cliff



On 5/1/2014 10:13 AM, Dmitriy Lyubimov wrote:
>
>> I'd be happy to see a concept of how to bring the operations of the DSL
>> onto h20, as well as a detailed description of h20's programming and
>> execution model.
> +1.
>
>>
>> --sebastian
>>

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Thu, May 1, 2014 at 10:08 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Goal really is to get feedback from this group on how well that attempt
>> is working.
>> Is there a better API?  What is it?
>> What can be improved?
>> How clumsy is the current marrage of H2OMatrix vs Matrix?
>> What's the mental cost of H2O's "tall skinny data" vs Mahout's
>> All-The-Worlds-A-(squarish)-Matrix model?
>>
>
> I think Mahout's matrix was never intended to be an abstraction for
> running distributed computations. I don't understand why we would want to
> have an h20 backed matrix which offers methods that users should not call
> because they break the underlying partitioning scheme such as assignRow(),
> which has the comment "// Calling this likely indicates a huge performance
> bug." To me this indicates that the underlying design is broken.
>
> This is exactly the reason why there is a separation between in-core
> matrices and distributed matrices in the DSL.

+1. This captures the very essense of my sole objection in M-1500 exactly.

>
>
>  Right now we're working on cleaning up the H2O internal DSL to make it
>> better support either Spark/Scala and/or Dmitriy's DSL - plus also our
>> commitment to running R.  I'm hoping Mahout volunteers will peek at it
>>
>
> I'd be happy to see a concept of how to bring the operations of the DSL
> onto h20, as well as a detailed description of h20's programming and
> execution model.

+1.

>
>
> --sebastian
>

Re: Straw poll re: H2O ?

Posted by Sebastian Schelter <ss...@apache.org>.

> Goal really is to get feedback from this group on how well that attempt
> is working.
> Is there a better API?  What is it?
> What can be improved?
> How clumsy is the current marrage of H2OMatrix vs Matrix?
> What's the mental cost of H2O's "tall skinny data" vs Mahout's
> All-The-Worlds-A-(squarish)-Matrix model?

I think Mahout's matrix was never intended to be an abstraction for 
running distributed computations. I don't understand why we would want 
to have an h20 backed matrix which offers methods that users should not 
call because they break the underlying partitioning scheme such as 
assignRow(), which has the comment "// Calling this likely indicates a 
huge performance bug." To me this indicates that the underlying design 
is broken.

This is exactly the reason why there is a separation between in-core 
matrices and distributed matrices in the DSL.

> Right now we're working on cleaning up the H2O internal DSL to make it
> better support either Spark/Scala and/or Dmitriy's DSL - plus also our
> commitment to running R.  I'm hoping Mahout volunteers will peek at it

I'd be happy to see a concept of how to bring the operations of the DSL 
onto h20, as well as a detailed description of h20's programming and 
execution model.

--sebastian

Re: Straw poll re: H2O ?

Posted by Cliff Click <cc...@gmail.com>.

Naw - the KMeans was strictly as a "sample" attempt to blend the H2O & 
Mahout coding.

Goal really is to get feedback from this group on how well that attempt 
is working.
Is there a better API?  What is it?
What can be improved?
How clumsy is the current marrage of H2OMatrix vs Matrix?
What's the mental cost of H2O's "tall skinny data" vs Mahout's 
All-The-Worlds-A-(squarish)-Matrix model?

Right now we're working on cleaning up the H2O internal DSL to make it 
better support either Spark/Scala and/or Dmitriy's DSL - plus also our 
commitment to running R.  I'm hoping Mahout volunteers will peek at it

   https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java

and comment before we go further down this path.

If - ultimately - Mahout decides to drop the current Matrix API for 
something more bulk/scale/parallel friendly - we're happy to go along 
with that also.

Cliff

On 5/1/2014 9:01 AM, Pat Ferrel wrote:
> Odd that the Kmeans implementation isn’t a way to demonstrate performance. Seems like anyone could grab that and try it with the same data on MLlib and perform a principled analysis. Or just run the same data through h2o and MLlib. This seems like a good way to look at the forrest instead of the trees.
>
> BTW any generalization effort to support two execution engines will have to abstract away the SparkContext. This is where IO, job control, and engine tuning happens. Abstracting the DSL is not sufficient. Any hypothetical MahoutContext (a good idea for sure) if it deviated significantly from a SparkContext will have broad impact.
>
> http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext
>
>

Re: Straw poll re: H2O ?

Posted by Pat Ferrel <pa...@gmail.com>.

Odd that the Kmeans implementation isn’t a way to demonstrate performance. Seems like anyone could grab that and try it with the same data on MLlib and perform a principled analysis. Or just run the same data through h2o and MLlib. This seems like a good way to look at the forrest instead of the trees.

BTW any generalization effort to support two execution engines will have to abstract away the SparkContext. This is where IO, job control, and engine tuning happens. Abstracting the DSL is not sufficient. Any hypothetical MahoutContext (a good idea for sure) if it deviated significantly from a SparkContext will have broad impact.

http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext

On May 1, 2014, at 8:40 AM, Cliff Click <cc...@gmail.com> wrote:

H2O will launch an internal Task in the single-digit microsecond range. Because of this, we can launch 100,000's (millions?) a second... leading to fine-grained data parallelism, and high CPU utilization. This is a big piece of our single-node speed. Some other distributed Task-launching solutions I've seen tend to require a network-hop per-task... leading to your 10ms to launch as task requirement, leading to a limit of a few 1000 Tasks/sec requiring tasks that are much larger and coarser than H2O's... leading to much lower CPU utilization.

Also, I'm getting 200micro-second ping's between my datacenter machines.... down from 10msec. It's decent commodity hardware, nothing special. Meaning: H2O can launch task on an entire 32-node cluster in about 1msec, starting from a single driving node (log-tree fanout, depth 5, 200micro-second single UDP packet launch, 1micro-second internal task launch).

And this latency matters when the work itself is lots and lots "small" jobs, as is common when a DSL such as Mahout or Spark/Scala or R is driving simple operators over bulk data.

Cliff

On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote:
> This is kind of an old news. They all do, for years now. I've been building a system that does real time distributed pipelines (~30 ms to start all steps in pipeline + in-core complexity) for years. Note that node-to-node hop in clouds are usually mean at about 10ms so microseconds are kind of out of question for network performance reasons in real life except for private racks. The only thing that doesn't do this is the MR variety of Hadoop.

Re: Straw poll re: H2O ?

Posted by Cliff Click <cc...@gmail.com>.

H2O will launch an internal Task in the single-digit microsecond range.  
Because of this, we can launch 100,000's (millions?) a second... leading 
to fine-grained data parallelism, and high CPU utilization.  This is a 
big piece of our single-node speed.  Some other distributed 
Task-launching solutions I've seen tend to require a network-hop 
per-task... leading to your 10ms to launch as task requirement, leading 
to a limit of a few 1000 Tasks/sec requiring tasks that are much larger 
and coarser than H2O's... leading to much lower CPU utilization.

Also, I'm getting 200micro-second ping's between my datacenter 
machines.... down from 10msec.  It's decent commodity hardware, nothing 
special.  Meaning: H2O can launch task on an entire 32-node cluster in 
about 1msec, starting from a single driving node (log-tree fanout, depth 
5, 200micro-second single UDP packet launch, 1micro-second internal task 
launch).

And this latency matters when the work itself is lots and lots "small" 
jobs, as is common when a DSL such as Mahout or Spark/Scala or R is 
driving simple operators over bulk data.

Cliff

On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote:
> This is kind of an old news. They all do, for years now. I've been 
> building a system that does real time distributed pipelines (~30 ms to 
> start all steps in pipeline + in-core complexity) for years. Note that 
> node-to-node hop in clouds are usually mean at about 10ms so 
> microseconds are kind of out of question for network performance 
> reasons in real life except for private racks. The only thing that 
> doesn't do this is the MR variety of Hadoop.

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Apr 30, 2014 at 3:24 PM, Ted Dunning <te...@gmail.com> wrote:

> Inline
>
>
> On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > >
> > > My motivation to accept comes from the fact that they have machine
> > learning
> > > codes that are as fast as what google has internally.  They completely
> > > crush all of the spark efforts on speed.
> > >
> >
> > correct me if i am wrong. h20 performance strengths come from speed of
> > in-core computations and efficient compression (that's what i heard at
> > least).
> >
>
> Those two factors are key.  In addition, the ability to dispatch parallel
> computations with microsecond latencies is also important as well as the
> ability to transparently communicate at high speeds between processes both
> local and remote.
>

This is kind of an old news.  They all do, for years now. I've been
building a system that does real time distributed pipelines (~30 ms to
start all steps in pipeline + in-core complexity)  for years.  Note that
node-to-node hop in clouds are usually mean at about 10ms so microseconds
are kind of out of question for network performance reasons in real life
except for private racks.

The only thing that doesn't do this is the MR variety of Hadoop.

Re: Straw poll re: H2O ?

Posted by Ted Dunning <te...@gmail.com>.

Inline


On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> >
> > My motivation to accept comes from the fact that they have machine
> learning
> > codes that are as fast as what google has internally.  They completely
> > crush all of the spark efforts on speed.
> >
>
> correct me if i am wrong. h20 performance strengths come from speed of
> in-core computations and efficient compression (that's what i heard at
> least).
>

Those two factors are key.  In addition, the ability to dispatch parallel
computations with microsecond latencies is also important as well as the
ability to transparently communicate at high speeds between processes both
local and remote.


> in DSL effort these are managed by Mahout-math (in-core vector and matrix
> implementations and speed of their serialization, respectively), regardless
> of the distributed model.
>
> I am not aware of any benchmark done of say in-core sparse matrix by
> in-core sparse matrix multiplication between Mahout-math and h2o (probably
> because h2o doesn't have in-core matrices as it stands), but assuming there
> were, the above statement should be corrected in a sense that h2o beats the
> dust out of Mahout-math in speed of computation and serialization.
>

h2o does have parallel in-core matrix operations.  They are really fast.
 That isn't the same as having a real benchmark against Mahout ops.

That statement , as it stands today, is found to be easily agreeable with
> by me.
>
> However, the point is to provide proper decoupling of programming model,
> distributed block management, and in-core computation/serialization
> concerns. If I have a proper abstraction and decoupling for in-core
> operations and serialization, I am free to plug in any in-core math and
> serialization of thereof, including  h2o. Therefore, this becomes secondary
> issue as opposed to general architecture.
>

This is goal.  Reality is bound to be different.   But that is what
attempting the plugging in of different modules is all about.

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <te...@gmail.com> wrote:

>
> My motivation to accept comes from the fact that they have machine learning
> codes that are as fast as what google has internally.  They completely
> crush all of the spark efforts on speed.
>

correct me if i am wrong. h20 performance strengths come from speed of
in-core computations and efficient compression (that's what i heard at
least).

in DSL effort these are managed by Mahout-math (in-core vector and matrix
implementations and speed of their serialization, respectively), regardless
of the distributed model.

I am not aware of any benchmark done of say in-core sparse matrix by
in-core sparse matrix multiplication between Mahout-math and h2o (probably
because h2o doesn't have in-core matrices as it stands), but assuming there
were, the above statement should be corrected in a sense that h2o beats the
dust out of Mahout-math in speed of computation and serialization.

That statement , as it stands today, is found to be easily agreeable with
by me.

However, the point is to provide proper decoupling of programming model,
distributed block management, and in-core computation/serialization
concerns. If I have a proper abstraction and decoupling for in-core
operations and serialization, I am free to plug in any in-core math and
serialization of thereof, including  h2o. Therefore, this becomes secondary
issue as opposed to general architecture.


>
>
>
> On Wed, Apr 30, 2014 at 3:52 PM, Pat Ferrel <pa...@gmail.com> wrote:
>
> > Where is the motivation to integrate now coming from? A sample
> > bit—Kmeans—was integrated with Mahout-ish input? How did that stack up to
> > say MLlib?
> >
> >
> > On Apr 30, 2014, at 2:36 AM, Ellen Friedman <b....@gmail.com>
> > wrote:
> >
> > I am weighing in here on issues of great concern but non-technical.
> >
> > 1. One of the great things about Mahout is the community – not an easy
> > thing to have achieved given that people are dispersed geographically
> > and there is no single focus or company backing the project. In short,
> > the people who make Mahout are doing something cool.
> >
> > Suggestions to try to break it into different groups, Mahout-Spark and
> > Mahout2o, run counter to this success. Why fragment it at exactly the
> > moment when new contributors (from 0xdata) are coming forward ?  The
> > spirit of this project has been inclusive. Let's not  change that now.
> >
> > 2. Sebastian pointed out:
> >
> > "We agreed to give the h2O guys a shot for exploration of a possible
> > integration into Mahout. We should be grateful that they are investing
> > a lot of time into this, and should help whereever we can. Once they
> > come up with a concrete proposal or patch, we will have a look at it,
> > have a deep, technical and polite discussion, and make a decision
> > afterwards."
> >
> > +1
> >
> > We agreed to explore the h2o option. Why use of lots of time and
> > energy in re-visiting and second guessing that decision? Let it go
> > forward, likely some great things will emerge for Mahout, and if not,
> > then we say "thank you" to h2o contributors for giving it a try.
> >
> > As the guys from h2o are adding new resources to do this development,
> > it is not really detracting anything from Mahout's resources except
> > when someone opens one of these discussions that lead to fragmentation
> > and distraction. I'm not a coder and not as technical as any of you,
> > but from my view It seems to be the talk and not the development that
> > is distracting.
> >
> > 3. Over the last year, there has been growing and widespread interest
> > in Mahout from the outside world, and now, with the new changes to
> > support Scala, Spark and h2o (possibly Stratosphere later) the growing
> > interest has turned into excitement. This is a great time for the
> > project – tons of effort but moving toward a big result.
> >
> > Users will have some excellent new choices, all parts of Mahout will
> > benefit. And if in the future it is seen that some of the new features
> > are not being widely or successfully used, they will be deprecated, as
> > was done during the big clean-up of the 0.8 release. New choices, new
> > ways to use Mahout, new people getting involved – this is excellent.
> >
> > 4. My thought is, stick together, embrace change, welcome new comers
> > and be very proud to be building the new Mahout.
> >
> >
> >
> > On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
> > > For reasons of transparency in this discussion, I should add that I am
> a
> > > committer on the upcoming Stratosphere ASF podling, co-worker of the
> > > main developers and have contributed to it as part of my PhD.
> > >
> > > On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
> > >> Anand,
> > >>
> > >> I'm trying to answer some of your questions, and my answers highlight
> > >> the points that I would like to see clarified about h20.
> > >>
> > >> On 04/28/2014 11:13 PM, Anand Avati wrote:
> > >>
> > >>> 1. Why is the DSL claiming to have (in its vision) logical vs
> physical
> > >>> separation if not for providing multiple compute backends?
> > >>
> > >> This is not a claim or a vision, the DSL already has this separation.
> > >> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
> > >> operator for executing a Transpose-Times-Self matrix multiplication.
> In
> > >> o.a.m.sparkbindings.blas.AtA you will find two physical operator
> > >> implementations for that. The choice which one to use depends on
> whether
> > >> there is enough memory to hold certain intermediary results in memory.
> > >>
> > >> The primary intention of a separation into logical and physical
> > >> operators is to allow for a declarative programming style on the users
> > >> side and for an optimizer on the system side which automatically
> chooses
> > >> the optimal physical operator for the execution of a specific program.
> > >>
> > >> This choice of the physical operator might depend on the shape and
> > >> amount of the data processed as well on the underlying available
> > >> resources. *The separation into logical and physical operators clearly
> > >> doesn't imply to have multiple backends*. It only makes it very easy
> to
> > >> support them.
> > >>
> > >>>
> > >>> 2. Does the proposal of having a new DSL backend in the future (for
> e.g
> > >>> stratosphere as suggested elsewhere) make you:
> > >>
> > >>> -- worry that stratosphere would be a dependency to Mahout?
> > >>
> > >> Stratosphere has been accepted as a incubator project in the ASF
> > >> recently, so the worry about such a dependency is naturally less than
> > >> about an externally managed project like h20.
> > >>
> > >>> -- worry that as a user/commiter/contributor you have to worry about
> a
> > >>> new
> > >>> framework?
> > >>
> > >> In my eyes, there is a big difference between Spark/Stratosphere and
> > >> h20. Spark and Stratosphere have a clearly defined programming and
> > >> execution model. They execute programs that are composed of a DAG of
> > >> operators. The set of operators has clearly defined semantics and
> > >> parallelization strategies. If you compare their operators, you will
> > >> find that they offer pretty much the same in lightly different
> flavors.
> > >> For both, there are scientific papers that in detail explain all these
> > >> things.
> > >>
> > >> I have asked about a detailed description of h20's programming model
> and
> > >> execution model and I searched the documentation, but I haven't been
> > >> able to find something that clearly describes how things are done. I
> > >> would love to read up on this, but until I'm presented with this, I
> have
> > >> to assume that such a principled foundation is missing.
> > >>
> > >>
> > >> --sebastian
> > >>
> > >
> > >
> >
> >
>

Re: Straw poll re: H2O ?

Posted by Pat Ferrel <pa...@gmail.com>.

On Apr 30, 2014, at 7:06 AM, Ted Dunning <te...@gmail.com> wrote:

The motivation to contribute comes from h2o.

My motivation to accept comes from the fact that they have machine learning
codes that are as fast as what google has internally.  They completely
crush all of the spark efforts on speed.

The sample bit was more to experiment with ways to bring the mahout frame
of mind together with the h2o implementation capabilities.  The real
contribution will be linked to the DSL work, not the old API.




On Wed, Apr 30, 2014 at 3:52 PM, Pat Ferrel <pa...@gmail.com> wrote:

> Where is the motivation to integrate now coming from? A sample
> bit—Kmeans—was integrated with Mahout-ish input? How did that stack up to
> say MLlib?
> 
> 
> On Apr 30, 2014, at 2:36 AM, Ellen Friedman <b....@gmail.com>
> wrote:
> 
> I am weighing in here on issues of great concern but non-technical.
> 
> 1. One of the great things about Mahout is the community – not an easy
> thing to have achieved given that people are dispersed geographically
> and there is no single focus or company backing the project. In short,
> the people who make Mahout are doing something cool.
> 
> Suggestions to try to break it into different groups, Mahout-Spark and
> Mahout2o, run counter to this success. Why fragment it at exactly the
> moment when new contributors (from 0xdata) are coming forward ?  The
> spirit of this project has been inclusive. Let's not  change that now.
> 
> 2. Sebastian pointed out:
> 
> "We agreed to give the h2O guys a shot for exploration of a possible
> integration into Mahout. We should be grateful that they are investing
> a lot of time into this, and should help whereever we can. Once they
> come up with a concrete proposal or patch, we will have a look at it,
> have a deep, technical and polite discussion, and make a decision
> afterwards."
> 
> +1
> 
> We agreed to explore the h2o option. Why use of lots of time and
> energy in re-visiting and second guessing that decision? Let it go
> forward, likely some great things will emerge for Mahout, and if not,
> then we say "thank you" to h2o contributors for giving it a try.
> 
> As the guys from h2o are adding new resources to do this development,
> it is not really detracting anything from Mahout's resources except
> when someone opens one of these discussions that lead to fragmentation
> and distraction. I'm not a coder and not as technical as any of you,
> but from my view It seems to be the talk and not the development that
> is distracting.
> 
> 3. Over the last year, there has been growing and widespread interest
> in Mahout from the outside world, and now, with the new changes to
> support Scala, Spark and h2o (possibly Stratosphere later) the growing
> interest has turned into excitement. This is a great time for the
> project – tons of effort but moving toward a big result.
> 
> Users will have some excellent new choices, all parts of Mahout will
> benefit. And if in the future it is seen that some of the new features
> are not being widely or successfully used, they will be deprecated, as
> was done during the big clean-up of the 0.8 release. New choices, new
> ways to use Mahout, new people getting involved – this is excellent.
> 
> 4. My thought is, stick together, embrace change, welcome new comers
> and be very proud to be building the new Mahout.
> 
> 
> 
> On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
>> For reasons of transparency in this discussion, I should add that I am a
>> committer on the upcoming Stratosphere ASF podling, co-worker of the
>> main developers and have contributed to it as part of my PhD.
>> 
>> On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
>>> Anand,
>>> 
>>> I'm trying to answer some of your questions, and my answers highlight
>>> the points that I would like to see clarified about h20.
>>> 
>>> On 04/28/2014 11:13 PM, Anand Avati wrote:
>>> 
>>>> 1. Why is the DSL claiming to have (in its vision) logical vs physical
>>>> separation if not for providing multiple compute backends?
>>> 
>>> This is not a claim or a vision, the DSL already has this separation.
>>> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
>>> operator for executing a Transpose-Times-Self matrix multiplication. In
>>> o.a.m.sparkbindings.blas.AtA you will find two physical operator
>>> implementations for that. The choice which one to use depends on whether
>>> there is enough memory to hold certain intermediary results in memory.
>>> 
>>> The primary intention of a separation into logical and physical
>>> operators is to allow for a declarative programming style on the users
>>> side and for an optimizer on the system side which automatically chooses
>>> the optimal physical operator for the execution of a specific program.
>>> 
>>> This choice of the physical operator might depend on the shape and
>>> amount of the data processed as well on the underlying available
>>> resources. *The separation into logical and physical operators clearly
>>> doesn't imply to have multiple backends*. It only makes it very easy to
>>> support them.
>>> 
>>>> 
>>>> 2. Does the proposal of having a new DSL backend in the future (for e.g
>>>> stratosphere as suggested elsewhere) make you:
>>> 
>>>> -- worry that stratosphere would be a dependency to Mahout?
>>> 
>>> Stratosphere has been accepted as a incubator project in the ASF
>>> recently, so the worry about such a dependency is naturally less than
>>> about an externally managed project like h20.
>>> 
>>>> -- worry that as a user/commiter/contributor you have to worry about a
>>>> new
>>>> framework?
>>> 
>>> In my eyes, there is a big difference between Spark/Stratosphere and
>>> h20. Spark and Stratosphere have a clearly defined programming and
>>> execution model. They execute programs that are composed of a DAG of
>>> operators. The set of operators has clearly defined semantics and
>>> parallelization strategies. If you compare their operators, you will
>>> find that they offer pretty much the same in lightly different flavors.
>>> For both, there are scientific papers that in detail explain all these
>>> things.
>>> 
>>> I have asked about a detailed description of h20's programming model and
>>> execution model and I searched the documentation, but I haven't been
>>> able to find something that clearly describes how things are done. I
>>> would love to read up on this, but until I'm presented with this, I have
>>> to assume that such a principled foundation is missing.
>>> 
>>> 
>>> --sebastian
>>> 
>> 
>> 
> 
>

Re: Straw poll re: H2O ?

Posted by Ted Dunning <te...@gmail.com>.

The motivation to contribute comes from h2o.

My motivation to accept comes from the fact that they have machine learning
codes that are as fast as what google has internally.  They completely
crush all of the spark efforts on speed.

The sample bit was more to experiment with ways to bring the mahout frame
of mind together with the h2o implementation capabilities.  The real
contribution will be linked to the DSL work, not the old API.




On Wed, Apr 30, 2014 at 3:52 PM, Pat Ferrel <pa...@gmail.com> wrote:

> Where is the motivation to integrate now coming from? A sample
> bit—Kmeans—was integrated with Mahout-ish input? How did that stack up to
> say MLlib?
>
>
> On Apr 30, 2014, at 2:36 AM, Ellen Friedman <b....@gmail.com>
> wrote:
>
> I am weighing in here on issues of great concern but non-technical.
>
> 1. One of the great things about Mahout is the community – not an easy
> thing to have achieved given that people are dispersed geographically
> and there is no single focus or company backing the project. In short,
> the people who make Mahout are doing something cool.
>
> Suggestions to try to break it into different groups, Mahout-Spark and
> Mahout2o, run counter to this success. Why fragment it at exactly the
> moment when new contributors (from 0xdata) are coming forward ?  The
> spirit of this project has been inclusive. Let's not  change that now.
>
> 2. Sebastian pointed out:
>
> "We agreed to give the h2O guys a shot for exploration of a possible
> integration into Mahout. We should be grateful that they are investing
> a lot of time into this, and should help whereever we can. Once they
> come up with a concrete proposal or patch, we will have a look at it,
> have a deep, technical and polite discussion, and make a decision
> afterwards."
>
> +1
>
> We agreed to explore the h2o option. Why use of lots of time and
> energy in re-visiting and second guessing that decision? Let it go
> forward, likely some great things will emerge for Mahout, and if not,
> then we say "thank you" to h2o contributors for giving it a try.
>
> As the guys from h2o are adding new resources to do this development,
> it is not really detracting anything from Mahout's resources except
> when someone opens one of these discussions that lead to fragmentation
> and distraction. I'm not a coder and not as technical as any of you,
> but from my view It seems to be the talk and not the development that
> is distracting.
>
> 3. Over the last year, there has been growing and widespread interest
> in Mahout from the outside world, and now, with the new changes to
> support Scala, Spark and h2o (possibly Stratosphere later) the growing
> interest has turned into excitement. This is a great time for the
> project – tons of effort but moving toward a big result.
>
> Users will have some excellent new choices, all parts of Mahout will
> benefit. And if in the future it is seen that some of the new features
> are not being widely or successfully used, they will be deprecated, as
> was done during the big clean-up of the 0.8 release. New choices, new
> ways to use Mahout, new people getting involved – this is excellent.
>
> 4. My thought is, stick together, embrace change, welcome new comers
> and be very proud to be building the new Mahout.
>
>
>
> On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
> > For reasons of transparency in this discussion, I should add that I am a
> > committer on the upcoming Stratosphere ASF podling, co-worker of the
> > main developers and have contributed to it as part of my PhD.
> >
> > On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
> >> Anand,
> >>
> >> I'm trying to answer some of your questions, and my answers highlight
> >> the points that I would like to see clarified about h20.
> >>
> >> On 04/28/2014 11:13 PM, Anand Avati wrote:
> >>
> >>> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> >>> separation if not for providing multiple compute backends?
> >>
> >> This is not a claim or a vision, the DSL already has this separation.
> >> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
> >> operator for executing a Transpose-Times-Self matrix multiplication. In
> >> o.a.m.sparkbindings.blas.AtA you will find two physical operator
> >> implementations for that. The choice which one to use depends on whether
> >> there is enough memory to hold certain intermediary results in memory.
> >>
> >> The primary intention of a separation into logical and physical
> >> operators is to allow for a declarative programming style on the users
> >> side and for an optimizer on the system side which automatically chooses
> >> the optimal physical operator for the execution of a specific program.
> >>
> >> This choice of the physical operator might depend on the shape and
> >> amount of the data processed as well on the underlying available
> >> resources. *The separation into logical and physical operators clearly
> >> doesn't imply to have multiple backends*. It only makes it very easy to
> >> support them.
> >>
> >>>
> >>> 2. Does the proposal of having a new DSL backend in the future (for e.g
> >>> stratosphere as suggested elsewhere) make you:
> >>
> >>> -- worry that stratosphere would be a dependency to Mahout?
> >>
> >> Stratosphere has been accepted as a incubator project in the ASF
> >> recently, so the worry about such a dependency is naturally less than
> >> about an externally managed project like h20.
> >>
> >>> -- worry that as a user/commiter/contributor you have to worry about a
> >>> new
> >>> framework?
> >>
> >> In my eyes, there is a big difference between Spark/Stratosphere and
> >> h20. Spark and Stratosphere have a clearly defined programming and
> >> execution model. They execute programs that are composed of a DAG of
> >> operators. The set of operators has clearly defined semantics and
> >> parallelization strategies. If you compare their operators, you will
> >> find that they offer pretty much the same in lightly different flavors.
> >> For both, there are scientific papers that in detail explain all these
> >> things.
> >>
> >> I have asked about a detailed description of h20's programming model and
> >> execution model and I searched the documentation, but I haven't been
> >> able to find something that clearly describes how things are done. I
> >> would love to read up on this, but until I'm presented with this, I have
> >> to assume that such a principled foundation is missing.
> >>
> >>
> >> --sebastian
> >>
> >
> >
>
>

Re: Straw poll re: H2O ?

Posted by Pat Ferrel <pa...@gmail.com>.

Where is the motivation to integrate now coming from? A sample bit—Kmeans—was integrated with Mahout-ish input? How did that stack up to say MLlib?

On Apr 30, 2014, at 2:36 AM, Ellen Friedman <b....@gmail.com> wrote:

I am weighing in here on issues of great concern but non-technical.

1. One of the great things about Mahout is the community – not an easy
thing to have achieved given that people are dispersed geographically
and there is no single focus or company backing the project. In short,
the people who make Mahout are doing something cool.

Suggestions to try to break it into different groups, Mahout-Spark and
Mahout2o, run counter to this success. Why fragment it at exactly the
moment when new contributors (from 0xdata) are coming forward ?  The
spirit of this project has been inclusive. Let's not  change that now.

2. Sebastian pointed out:

"We agreed to give the h2O guys a shot for exploration of a possible
integration into Mahout. We should be grateful that they are investing
a lot of time into this, and should help whereever we can. Once they
come up with a concrete proposal or patch, we will have a look at it,
have a deep, technical and polite discussion, and make a decision
afterwards."

+1

We agreed to explore the h2o option. Why use of lots of time and
energy in re-visiting and second guessing that decision? Let it go
forward, likely some great things will emerge for Mahout, and if not,
then we say "thank you" to h2o contributors for giving it a try.

As the guys from h2o are adding new resources to do this development,
it is not really detracting anything from Mahout's resources except
when someone opens one of these discussions that lead to fragmentation
and distraction. I'm not a coder and not as technical as any of you,
but from my view It seems to be the talk and not the development that
is distracting.

3. Over the last year, there has been growing and widespread interest
in Mahout from the outside world, and now, with the new changes to
support Scala, Spark and h2o (possibly Stratosphere later) the growing
interest has turned into excitement. This is a great time for the
project – tons of effort but moving toward a big result.

Users will have some excellent new choices, all parts of Mahout will
benefit. And if in the future it is seen that some of the new features
are not being widely or successfully used, they will be deprecated, as
was done during the big clean-up of the 0.8 release. New choices, new
ways to use Mahout, new people getting involved – this is excellent.

4. My thought is, stick together, embrace change, welcome new comers
and be very proud to be building the new Mahout.

On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
> For reasons of transparency in this discussion, I should add that I am a
> committer on the upcoming Stratosphere ASF podling, co-worker of the
> main developers and have contributed to it as part of my PhD.
> 
> On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
>> Anand,
>> 
>> I'm trying to answer some of your questions, and my answers highlight
>> the points that I would like to see clarified about h20.
>> 
>> On 04/28/2014 11:13 PM, Anand Avati wrote:
>> 
>>> 1. Why is the DSL claiming to have (in its vision) logical vs physical
>>> separation if not for providing multiple compute backends?
>> 
>> This is not a claim or a vision, the DSL already has this separation.
>> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
>> operator for executing a Transpose-Times-Self matrix multiplication. In
>> o.a.m.sparkbindings.blas.AtA you will find two physical operator
>> implementations for that. The choice which one to use depends on whether
>> there is enough memory to hold certain intermediary results in memory.
>> 
>> The primary intention of a separation into logical and physical
>> operators is to allow for a declarative programming style on the users
>> side and for an optimizer on the system side which automatically chooses
>> the optimal physical operator for the execution of a specific program.
>> 
>> This choice of the physical operator might depend on the shape and
>> amount of the data processed as well on the underlying available
>> resources. *The separation into logical and physical operators clearly
>> doesn't imply to have multiple backends*. It only makes it very easy to
>> support them.
>> 
>>> 
>>> 2. Does the proposal of having a new DSL backend in the future (for e.g
>>> stratosphere as suggested elsewhere) make you:
>> 
>>> -- worry that stratosphere would be a dependency to Mahout?
>> 
>> Stratosphere has been accepted as a incubator project in the ASF
>> recently, so the worry about such a dependency is naturally less than
>> about an externally managed project like h20.
>> 
>>> -- worry that as a user/commiter/contributor you have to worry about a
>>> new
>>> framework?
>> 
>> In my eyes, there is a big difference between Spark/Stratosphere and
>> h20. Spark and Stratosphere have a clearly defined programming and
>> execution model. They execute programs that are composed of a DAG of
>> operators. The set of operators has clearly defined semantics and
>> parallelization strategies. If you compare their operators, you will
>> find that they offer pretty much the same in lightly different flavors.
>> For both, there are scientific papers that in detail explain all these
>> things.
>> 
>> I have asked about a detailed description of h20's programming model and
>> execution model and I searched the documentation, but I haven't been
>> able to find something that clearly describes how things are done. I
>> would love to read up on this, but until I'm presented with this, I have
>> to assume that such a principled foundation is missing.
>> 
>> 
>> --sebastian
>> 
> 
>

Re: Straw poll re: H2O ?

Posted by Ellen Friedman <b....@gmail.com>.

I am weighing in here on issues of great concern but non-technical.

1. One of the great things about Mahout is the community – not an easy
thing to have achieved given that people are dispersed geographically
and there is no single focus or company backing the project. In short,
the people who make Mahout are doing something cool.

Suggestions to try to break it into different groups, Mahout-Spark and
Mahout2o, run counter to this success. Why fragment it at exactly the
moment when new contributors (from 0xdata) are coming forward ?  The
spirit of this project has been inclusive. Let's not  change that now.

2. Sebastian pointed out:

"We agreed to give the h2O guys a shot for exploration of a possible
integration into Mahout. We should be grateful that they are investing
a lot of time into this, and should help whereever we can. Once they
come up with a concrete proposal or patch, we will have a look at it,
have a deep, technical and polite discussion, and make a decision
afterwards."

+1

We agreed to explore the h2o option. Why use of lots of time and
energy in re-visiting and second guessing that decision? Let it go
forward, likely some great things will emerge for Mahout, and if not,
then we say "thank you" to h2o contributors for giving it a try.

As the guys from h2o are adding new resources to do this development,
it is not really detracting anything from Mahout's resources except
when someone opens one of these discussions that lead to fragmentation
and distraction. I'm not a coder and not as technical as any of you,
but from my view It seems to be the talk and not the development that
is distracting.

3. Over the last year, there has been growing and widespread interest
in Mahout from the outside world, and now, with the new changes to
support Scala, Spark and h2o (possibly Stratosphere later) the growing
interest has turned into excitement. This is a great time for the
project – tons of effort but moving toward a big result.

Users will have some excellent new choices, all parts of Mahout will
benefit. And if in the future it is seen that some of the new features
are not being widely or successfully used, they will be deprecated, as
was done during the big clean-up of the 0.8 release. New choices, new
ways to use Mahout, new people getting involved – this is excellent.

4. My thought is, stick together, embrace change, welcome new comers
and be very proud to be building the new Mahout.

On 4/29/14, Sebastian Schelter <ss...@apache.org> wrote:
> For reasons of transparency in this discussion, I should add that I am a
> committer on the upcoming Stratosphere ASF podling, co-worker of the
> main developers and have contributed to it as part of my PhD.
>
> On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
>> Anand,
>>
>> I'm trying to answer some of your questions, and my answers highlight
>> the points that I would like to see clarified about h20.
>>
>> On 04/28/2014 11:13 PM, Anand Avati wrote:
>>
>>> 1. Why is the DSL claiming to have (in its vision) logical vs physical
>>> separation if not for providing multiple compute backends?
>>
>> This is not a claim or a vision, the DSL already has this separation.
>> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
>> operator for executing a Transpose-Times-Self matrix multiplication. In
>> o.a.m.sparkbindings.blas.AtA you will find two physical operator
>> implementations for that. The choice which one to use depends on whether
>> there is enough memory to hold certain intermediary results in memory.
>>
>> The primary intention of a separation into logical and physical
>> operators is to allow for a declarative programming style on the users
>> side and for an optimizer on the system side which automatically chooses
>> the optimal physical operator for the execution of a specific program.
>>
>> This choice of the physical operator might depend on the shape and
>> amount of the data processed as well on the underlying available
>> resources. *The separation into logical and physical operators clearly
>> doesn't imply to have multiple backends*. It only makes it very easy to
>> support them.
>>
>>>
>>> 2. Does the proposal of having a new DSL backend in the future (for e.g
>>> stratosphere as suggested elsewhere) make you:
>>
>>> -- worry that stratosphere would be a dependency to Mahout?
>>
>> Stratosphere has been accepted as a incubator project in the ASF
>> recently, so the worry about such a dependency is naturally less than
>> about an externally managed project like h20.
>>
>>> -- worry that as a user/commiter/contributor you have to worry about a
>>> new
>>> framework?
>>
>> In my eyes, there is a big difference between Spark/Stratosphere and
>> h20. Spark and Stratosphere have a clearly defined programming and
>> execution model. They execute programs that are composed of a DAG of
>> operators. The set of operators has clearly defined semantics and
>> parallelization strategies. If you compare their operators, you will
>> find that they offer pretty much the same in lightly different flavors.
>> For both, there are scientific papers that in detail explain all these
>> things.
>>
>> I have asked about a detailed description of h20's programming model and
>> execution model and I searched the documentation, but I haven't been
>> able to find something that clearly describes how things are done. I
>> would love to read up on this, but until I'm presented with this, I have
>> to assume that such a principled foundation is missing.
>>
>>
>> --sebastian
>>
>
>

Re: Straw poll re: H2O ?

Posted by Sebastian Schelter <ss...@apache.org>.

For reasons of transparency in this discussion, I should add that I am a 
committer on the upcoming Stratosphere ASF podling, co-worker of the 
main developers and have contributed to it as part of my PhD.

On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
> Anand,
>
> I'm trying to answer some of your questions, and my answers highlight
> the points that I would like to see clarified about h20.
>
> On 04/28/2014 11:13 PM, Anand Avati wrote:
>
>> 1. Why is the DSL claiming to have (in its vision) logical vs physical
>> separation if not for providing multiple compute backends?
>
> This is not a claim or a vision, the DSL already has this separation.
> Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
> operator for executing a Transpose-Times-Self matrix multiplication. In
> o.a.m.sparkbindings.blas.AtA you will find two physical operator
> implementations for that. The choice which one to use depends on whether
> there is enough memory to hold certain intermediary results in memory.
>
> The primary intention of a separation into logical and physical
> operators is to allow for a declarative programming style on the users
> side and for an optimizer on the system side which automatically chooses
> the optimal physical operator for the execution of a specific program.
>
> This choice of the physical operator might depend on the shape and
> amount of the data processed as well on the underlying available
> resources. *The separation into logical and physical operators clearly
> doesn't imply to have multiple backends*. It only makes it very easy to
> support them.
>
>>
>> 2. Does the proposal of having a new DSL backend in the future (for e.g
>> stratosphere as suggested elsewhere) make you:
>
>> -- worry that stratosphere would be a dependency to Mahout?
>
> Stratosphere has been accepted as a incubator project in the ASF
> recently, so the worry about such a dependency is naturally less than
> about an externally managed project like h20.
>
>> -- worry that as a user/commiter/contributor you have to worry about a
>> new
>> framework?
>
> In my eyes, there is a big difference between Spark/Stratosphere and
> h20. Spark and Stratosphere have a clearly defined programming and
> execution model. They execute programs that are composed of a DAG of
> operators. The set of operators has clearly defined semantics and
> parallelization strategies. If you compare their operators, you will
> find that they offer pretty much the same in lightly different flavors.
> For both, there are scientific papers that in detail explain all these
> things.
>
> I have asked about a detailed description of h20's programming model and
> execution model and I searched the documentation, but I haven't been
> able to find something that clearly describes how things are done. I
> would love to read up on this, but until I'm presented with this, I have
> to assume that such a principled foundation is missing.
>
>
> --sebastian
>

Re: Straw poll re: H2O ?

Posted by Sebastian Schelter <ss...@apache.org>.

Anand,

I'm trying to answer some of your questions, and my answers highlight 
the points that I would like to see clarified about h20.

On 04/28/2014 11:13 PM, Anand Avati wrote:

> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> separation if not for providing multiple compute backends?

This is not a claim or a vision, the DSL already has this separation. 
Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical 
operator for executing a Transpose-Times-Self matrix multiplication. In 
o.a.m.sparkbindings.blas.AtA you will find two physical operator 
implementations for that. The choice which one to use depends on whether 
there is enough memory to hold certain intermediary results in memory.

The primary intention of a separation into logical and physical 
operators is to allow for a declarative programming style on the users 
side and for an optimizer on the system side which automatically chooses 
the optimal physical operator for the execution of a specific program.

This choice of the physical operator might depend on the shape and 
amount of the data processed as well on the underlying available 
resources. *The separation into logical and physical operators clearly 
doesn't imply to have multiple backends*. It only makes it very easy to 
support them.

>
> 2. Does the proposal of having a new DSL backend in the future (for e.g
> stratosphere as suggested elsewhere) make you:

> -- worry that stratosphere would be a dependency to Mahout?

Stratosphere has been accepted as a incubator project in the ASF 
recently, so the worry about such a dependency is naturally less than 
about an externally managed project like h20.

> -- worry that as a user/commiter/contributor you have to worry about a new
> framework?

In my eyes, there is a big difference between Spark/Stratosphere and 
h20. Spark and Stratosphere have a clearly defined programming and 
execution model. They execute programs that are composed of a DAG of 
operators. The set of operators has clearly defined semantics and 
parallelization strategies. If you compare their operators, you will 
find that they offer pretty much the same in lightly different flavors. 
For both, there are scientific papers that in detail explain all these 
things.

I have asked about a detailed description of h20's programming model and 
execution model and I searched the documentation, but I haven't been 
able to find something that clearly describes how things are done. I 
would love to read up on this, but until I'm presented with this, I have 
to assume that such a principled foundation is missing.

--sebastian

RE: Straw poll re: H2O ?

Posted by Saikat Kanjilal <sx...@hotmail.com>.

My main question at this point would be (and I realize this discussion has been had in multiple places in a parts but not in enough detail), given that the user is staying in the DSL land, when would I use one backend versus another and the easy of getting up and running with 1 backend versus the other.   Keenly following all discussions in the meantime.

> Date: Mon, 28 Apr 2014 14:13:35 -0700
> Subject: Re: Straw poll re: H2O ?
> From: avati@gluster.org
> To: dev@mahout.apache.org
> CC: ssc@apache.org
> 
> Saikat, Pat,
> 
> For background, please refer to the "Mahout DSL vs Spark" discussion the
> for the general direction in which the integration is being explored. With
> that background, I would like to present some counter questions:
> 
> 1. Why is the DSL claiming to have (in its vision) logical vs physical
> separation if not for providing multiple compute backends?
> 
> 2. Does the proposal of having a new DSL backend in the future (for e.g
> stratosphere as suggested elsewhere) make you:
> -- propose mahout-stratosphere as a different top level project?
> -- worry that stratosphere would be a dependency to Mahout?
> -- worry that you won't be able to say "Future of Mahout is Spark .. but it
> also supports stratosphere"?
> -- worry that as a user/commiter/contributor you have to worry about a new
> framework?
> -- resist having a DSL backend for stratosphere because Hadoop vendors may
> not support it?
> 
> Obviously no, since they are all just different DSL backends.
> 
> Have you guys embraced the idea that the DSL allows for multiple backends
> (Spark being the first to get implemented)? or Not? Hence I do not
> understand the "problem" here.
> 
> Thanks
> 
> On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <sx...@hotmail.com>wrote:
> 
> > I would echo Pat's sentiments spot on related to the goal of supporting
> > both spark and H2O confusing folks that are interested in using, committing
> > to and trying to understand where Mahout is headed small to medium term.
> > I hate to throw this out but given the amount of "sometimes not so nice
> > back and forths I've seen on issue 1500" I really wonder whether we should
> > have mahout-spark and mahout-h2o as two different top level projects
> > potentially supporting a different set of algorithms underneath, yes I know
> > tieing mahout to a particular technology goes against the initial vision
> > but given the churn I'm seeing I'm not sure I understand what the current
> > vision even is :)
> >
> > > Subject: Re: Straw poll re: H2O ?
> > > From: pat@occamsmachete.com
> > > Date: Mon, 28 Apr 2014 13:17:03 -0700
> > > CC: ssc@apache.org
> > > To: dev@mahout.apache.org
> > >
> > > I haven’t heard a good explanation of what this project is. There should
> > be some small step like implementing an algo on h2o to takes the same input
> > as a current Hadoop Mahout job and produce the same result or do one not
> > already in Mahout. At least it will answer some technical questions and
> > shouldn’t take a lot of support from current committers to produce.
> > >
> > > I’m still not convinced that this is the primary thing that should drive
> > making it a Mahout dependency.
> > >
> > > I’m highly dubious of actively supporting and working on Mahout for
> > Spark and h2o. Not for technical reasons but because rebooting Mahout on
> > two platforms seems a non-starter. No project manager in the commercial
> > world would allow that sort of thing. And rightly so, it confuses users,
> > committers, contributors. You shouldn’t have a great deal of redundancy or
> > competing efforts _inside_ a project even an open source one. That’s for
> > separate projects and the incubator, right? There are plenty of examples of
> > going that route, Spark itself is redundant with Hadoop in many ways. Would
> > Apache accept h2o as a parallel project to Spark, if so why not do that?
> > >
> > > Question: Where do we (Mahout user, committer, contributor) invest
> > extremely precious time learning new languages, frameworks, architecture,
> > configurations, optimizations?
> > >
> > > Answer: Many will simply not choose but wait and see, or go elsewhere.
> > >
> > > Why? Because we fail to communicate “the future of Mahout is Spark
> > first—period” It keeps coming out "Spark and, well, h2o too”
> > >
> > > That is a momentum killer.  If we’re agreed on “Spark first” then
> > there’s no need to incubate Mahout 2, Spark and Mahout have already gone
> > through that and though Dmitriy’s DSL and Scala shell work is entirely new,
> > to the end user the jobs, input and output, and functionality will look
> > like a v2. People dealing with internals will see a different world but
> > they should be a minority of users and will hopefully like what they see.
> > >
> > >
> > > Somewhat off subject notes on external politics:
> > >
> > > We really need to make sure Mahout stays in all the big distros. That
> > means Sebastian’s comments are spot on: "The best way to help Mahout is to
> > pick up some of the work that needs to be done with regards to
> > documentation, examples, Hadoop 2 compatibility and designing the future,
> > especially with regards to dataframes”  All the distros are hadoop 2.
> > >
> > > Incubating Mahout 2 as another project is surely a way out of the
> > distros, another momentum killer.
> > >
> > > Another political question is whether an h2o dependency would be an
> > issue to the distros. If we are going to put big efforts into h2o let’s see
> > how that plays out first. Spark is already supported by them, even
> > Hortonworks has taken a first step with 2.1. If Mahout is in a distro the
> > distro will be asked to support it, that’s what they are paid for. Do they
> > want to support h2o? I have no idea how they would react to that but it
> > affects Mahout.
> > >
> > >
> > > For all these reasons I’d be -1 to any big-bang integration.
> > >
> > >
> > > > On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > > >
> > > > +1. I don't think anyone said anything, privately or publicly, about
> > h20
> > > > integration being a bad idea. It's just there's more than one way to
> > do it,
> > > > so debate is focusing on exploration of pluses and minuses of each
> > > > individual proposal (as they come to light). Part of difficulty here
> > was
> > > > that the expertise intersection of all parts being connected and
> > integrated
> > > > has been pretty poor on individual basis. So we have to go by scenarios
> > > > where a group of specialized experts tries to figure out the solution.
> > > >
> > > > w.r.t to incubation proposals, it seems dubious for a number reasons.
> > > >
> > > > Reason 1 is that these projects are the primary factor moving Mahout
> > > > anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
> > > > frankly not much left in Mahout, so it is reflection of more or less
> > common
> > > > opinion that the project would just spiral down on its own if the
> > things
> > > > stay status-quo.
> > > >
> > > > Reason 2 is that there are good (not irreplaceable, but good)
> > components in
> > > > Mahout that these efforts depend on. Therefore, incubation would be
> > faced
> > > > with a perspective of having dependencies on project that on its own is
> > > > winding down. Not good for incubation side.
> > > >
> > > > Reason 3 is that current effort is (IMO) minimalistic enough not to
> > warrant
> > > > a new project. It simply doesn't, and can't have the scale of things
> > like
> > > > Spark or Hadoop eco. There would be just not enough substance for a new
> > > > project at this point. I don't feel very strong about this point
> > though.
> > > >
> > > >
> > > > On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> > > >
> > > >> We all should calm down here and remind ourselves why we are doing
> > this
> > > >> whole thing: Because we love open source and want to have a vibrant
> > > >> community and a great piece of software.
> > > >>
> > > >> Mahout has come a long way and is at a crossroads right now, so its
> > only
> > > >> natural that there are heated discussions. But, we should immediately
> > stop
> > > >> the fingerpointing and related stuff, we have managed to avoid this
> > since
> > > >> Mahout's inception and we should continue to do so.
> > > >>
> > > >> The best way to help Mahout is to pick up some of the work that needs
> > to
> > > >> be done with regards to documentation, examples, Hadoop 2
> > compatibility and
> > > >> designing the future, especially with regards to dataframes e.g.
> > > >>
> > > >> We agreed to give the h2O guys a shot for exploration of a possible
> > > >> integration into Mahout. We should be grateful that they are
> > investing a
> > > >> lot of time into this, and should help whereever we can. Once they
> > come up
> > > >> with a concrete proposal or patch, we will have a look at it, have a
> > deep,
> > > >> technical and polite discussion, and make a decision afterwards.
> > > >>
> > > >> --sebastian
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 04/28/2014 07:42 PM, Anand Avati wrote:
> > > >>
> > > >>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
> > > >>>
> > > >>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
> > > >>>> <ji...@apache.org> wrote:
> > > >>>>
> > > >>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly
> > what
> > > >>>>>
> > > >>>> is encouraging the h2o work to be done a bit apart. It simply isn't
> > > >>>> efficient to have to answer so many off-topic points whenever any
> > reports
> > > >>>> on work in progress are given.
> > > >>>>
> > > >>>>>
> > > >>>>> I think this has been the off-topic here.
> > > >>>>>
> > > >>>>> Calling my comments "emotional" or "non-technical", or _loosely_
> > > >>>>>
> > > >>>> paraphrasing me.
> > > >>>>
> > > >>>> Yes, the personal finger-pointing parts don't belong and don't
> > > >>>> convince anyone, let's skip those.
> > > >>>>
> > > >>>>
> > > >>> +1. Let's skip those.
> > > >>>
> > > >>>
> > > >>> From the sidelines, I see a bunch of work intended for Mahout
> > > >>>
> > > >>>> proceeding outside the community such as it is, and even Apache. Of
> > > >>>> course, contributions are always prepped externally to some degree.
> > I
> > > >>>> create, debug, change patches before posting them, maybe checking in
> > > >>>> early on choices that others may want input on.
> > > >>>>
> > > >>>> This is a large-ish change being proposed, IIUC. I can see one
> > person
> > > >>>> who publicly, and at least two who privately, have clear
> > reservations
> > > >>>> about this direction.
> > > >>>>
> > > >>>
> > > >>>
> > > >>> It will probably be a large-ish change, indeed. But my personal take
> > is
> > > >>> that, non-technical aspects of the debate is unfortunately taking
> > > >>> precedence over real technical parts. Please refer to email thread
> > "Mahout
> > > >>> DSL vs Spark".
> > > >>>
> > > >>>
> > > >>>
> > > >>> It certainly appears funny vis-a-vis the "Apache
> > > >>>> way" to work on a contribution *because* one (or more) other
> > > >>>> committers aren't convinced.
> > > >>>>
> > > >>>>
> > > >>> As mentioned in the referred email thread, a lot of the technical
> > issues
> > > >>> which got addressed in the work which was carried out outside of
> > Apache,
> > > >>> was really sorting out and highlighting build and classloader related
> > > >>> challenges on the H2O side. There was little motivation to carry out
> > those
> > > >>> discussions on the Mahout lists as it was really ~99% H2O specific
> > > >>> discussions and noise/spam to the Mahout community.
> > > >>>
> > > >>> I don't think that's important to dither about. What is, is this: if
> > a
> > > >>>
> > > >>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
> > > >>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
> > > >>>> yet, but it seems like a quite possible outcome.
> > > >>>>
> > > >>>>
> > > >>> As an outsider, my opinion is that the proposed need for a VOTE is a
> > > >>> largely masqueraded problem built around the perception of
> > disagreement
> > > >>> over something vague, abstract and inaccurate. And therefore
> > premature.
> > > >>> That being said the PMC may vote on any issues/non-issues it may
> > please.
> > > >>>
> > > >>> Would be a shame to do a lot of work, intending it for a commit, and
> > > >>>
> > > >>>> then find there is not consensus.
> > > >>>>
> > > >>>>
> > > >>> Exactly the kind of inaccurate perception I meant. While we are (at
> > least
> > > >>> I
> > > >>> am) exploring the best fit model for integration, and exploration by
> > > >>> definition involves taking potentially wrong steps and backtracking
> > if
> > > >>> necessary, the perception unfortunately seems to be that the proposed
> > > >>> intermediate (potentially wrong) steps are some kind of pre-decided
> > plan
> > > >>> of
> > > >>> action. So, no, there WOULDN'T be a lot of work intended for a commit
> > > >>> against consensus.
> > > >>>
> > > >>> So is it better to figure out earlier than later whether these 2+
> > > >>>
> > > >>>> parallel tracks have enough commonality to coexist?
> > > >>>>
> > > >>>
> > > >>>
> > > >>> Whether two parallel tracks (I assume the spark track and the H2O
> > track?)
> > > >>> have enough commonality to exist - one way you surely cannot get the
> > right
> > > >>> answer for this (except by co-incidence) is by taking a vote from a
> > group
> > > >>> who are experts in only either one of those tracks. From what I see,
> > most
> > > >>> of the opposition has been due to a combination of lack of
> > understanding
> > > >>> of
> > > >>> H2O and (welcome) skepticism. If, as a contributor, I find there is
> > no
> > > >>> natural or beneficial way to co-exist with Spark, I wouldn't waste
> > my time
> > > >>> writing code, and for sure am not dependent on another group's vote
> > to
> > > >>> make
> > > >>> that decision for me.
> > > >>>
> > > >>> Avati
> > > >>>
> > > >>>
> > > >>
> > > >
> >
> >

Re: Straw poll re: H2O ?

Posted by Anand Avati <av...@gluster.org>.

Saikat, Pat,

For background, please refer to the "Mahout DSL vs Spark" discussion the
for the general direction in which the integration is being explored. With
that background, I would like to present some counter questions:

1. Why is the DSL claiming to have (in its vision) logical vs physical
separation if not for providing multiple compute backends?

2. Does the proposal of having a new DSL backend in the future (for e.g
stratosphere as suggested elsewhere) make you:
-- propose mahout-stratosphere as a different top level project?
-- worry that stratosphere would be a dependency to Mahout?
-- worry that you won't be able to say "Future of Mahout is Spark .. but it
also supports stratosphere"?
-- worry that as a user/commiter/contributor you have to worry about a new
framework?
-- resist having a DSL backend for stratosphere because Hadoop vendors may
not support it?

Obviously no, since they are all just different DSL backends.

Have you guys embraced the idea that the DSL allows for multiple backends
(Spark being the first to get implemented)? or Not? Hence I do not
understand the "problem" here.

Thanks

On Mon, Apr 28, 2014 at 1:29 PM, Saikat Kanjilal <sx...@hotmail.com>wrote:

> I would echo Pat's sentiments spot on related to the goal of supporting
> both spark and H2O confusing folks that are interested in using, committing
> to and trying to understand where Mahout is headed small to medium term.
> I hate to throw this out but given the amount of "sometimes not so nice
> back and forths I've seen on issue 1500" I really wonder whether we should
> have mahout-spark and mahout-h2o as two different top level projects
> potentially supporting a different set of algorithms underneath, yes I know
> tieing mahout to a particular technology goes against the initial vision
> but given the churn I'm seeing I'm not sure I understand what the current
> vision even is :)
>
> > Subject: Re: Straw poll re: H2O ?
> > From: pat@occamsmachete.com
> > Date: Mon, 28 Apr 2014 13:17:03 -0700
> > CC: ssc@apache.org
> > To: dev@mahout.apache.org
> >
> > I haven’t heard a good explanation of what this project is. There should
> be some small step like implementing an algo on h2o to takes the same input
> as a current Hadoop Mahout job and produce the same result or do one not
> already in Mahout. At least it will answer some technical questions and
> shouldn’t take a lot of support from current committers to produce.
> >
> > I’m still not convinced that this is the primary thing that should drive
> making it a Mahout dependency.
> >
> > I’m highly dubious of actively supporting and working on Mahout for
> Spark and h2o. Not for technical reasons but because rebooting Mahout on
> two platforms seems a non-starter. No project manager in the commercial
> world would allow that sort of thing. And rightly so, it confuses users,
> committers, contributors. You shouldn’t have a great deal of redundancy or
> competing efforts _inside_ a project even an open source one. That’s for
> separate projects and the incubator, right? There are plenty of examples of
> going that route, Spark itself is redundant with Hadoop in many ways. Would
> Apache accept h2o as a parallel project to Spark, if so why not do that?
> >
> > Question: Where do we (Mahout user, committer, contributor) invest
> extremely precious time learning new languages, frameworks, architecture,
> configurations, optimizations?
> >
> > Answer: Many will simply not choose but wait and see, or go elsewhere.
> >
> > Why? Because we fail to communicate “the future of Mahout is Spark
> first—period” It keeps coming out "Spark and, well, h2o too”
> >
> > That is a momentum killer.  If we’re agreed on “Spark first” then
> there’s no need to incubate Mahout 2, Spark and Mahout have already gone
> through that and though Dmitriy’s DSL and Scala shell work is entirely new,
> to the end user the jobs, input and output, and functionality will look
> like a v2. People dealing with internals will see a different world but
> they should be a minority of users and will hopefully like what they see.
> >
> >
> > Somewhat off subject notes on external politics:
> >
> > We really need to make sure Mahout stays in all the big distros. That
> means Sebastian’s comments are spot on: "The best way to help Mahout is to
> pick up some of the work that needs to be done with regards to
> documentation, examples, Hadoop 2 compatibility and designing the future,
> especially with regards to dataframes”  All the distros are hadoop 2.
> >
> > Incubating Mahout 2 as another project is surely a way out of the
> distros, another momentum killer.
> >
> > Another political question is whether an h2o dependency would be an
> issue to the distros. If we are going to put big efforts into h2o let’s see
> how that plays out first. Spark is already supported by them, even
> Hortonworks has taken a first step with 2.1. If Mahout is in a distro the
> distro will be asked to support it, that’s what they are paid for. Do they
> want to support h2o? I have no idea how they would react to that but it
> affects Mahout.
> >
> >
> > For all these reasons I’d be -1 to any big-bang integration.
> >
> >
> > > On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > >
> > > +1. I don't think anyone said anything, privately or publicly, about
> h20
> > > integration being a bad idea. It's just there's more than one way to
> do it,
> > > so debate is focusing on exploration of pluses and minuses of each
> > > individual proposal (as they come to light). Part of difficulty here
> was
> > > that the expertise intersection of all parts being connected and
> integrated
> > > has been pretty poor on individual basis. So we have to go by scenarios
> > > where a group of specialized experts tries to figure out the solution.
> > >
> > > w.r.t to incubation proposals, it seems dubious for a number reasons.
> > >
> > > Reason 1 is that these projects are the primary factor moving Mahout
> > > anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
> > > frankly not much left in Mahout, so it is reflection of more or less
> common
> > > opinion that the project would just spiral down on its own if the
> things
> > > stay status-quo.
> > >
> > > Reason 2 is that there are good (not irreplaceable, but good)
> components in
> > > Mahout that these efforts depend on. Therefore, incubation would be
> faced
> > > with a perspective of having dependencies on project that on its own is
> > > winding down. Not good for incubation side.
> > >
> > > Reason 3 is that current effort is (IMO) minimalistic enough not to
> warrant
> > > a new project. It simply doesn't, and can't have the scale of things
> like
> > > Spark or Hadoop eco. There would be just not enough substance for a new
> > > project at this point. I don't feel very strong about this point
> though.
> > >
> > >
> > > On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
> > >
> > >> We all should calm down here and remind ourselves why we are doing
> this
> > >> whole thing: Because we love open source and want to have a vibrant
> > >> community and a great piece of software.
> > >>
> > >> Mahout has come a long way and is at a crossroads right now, so its
> only
> > >> natural that there are heated discussions. But, we should immediately
> stop
> > >> the fingerpointing and related stuff, we have managed to avoid this
> since
> > >> Mahout's inception and we should continue to do so.
> > >>
> > >> The best way to help Mahout is to pick up some of the work that needs
> to
> > >> be done with regards to documentation, examples, Hadoop 2
> compatibility and
> > >> designing the future, especially with regards to dataframes e.g.
> > >>
> > >> We agreed to give the h2O guys a shot for exploration of a possible
> > >> integration into Mahout. We should be grateful that they are
> investing a
> > >> lot of time into this, and should help whereever we can. Once they
> come up
> > >> with a concrete proposal or patch, we will have a look at it, have a
> deep,
> > >> technical and polite discussion, and make a decision afterwards.
> > >>
> > >> --sebastian
> > >>
> > >>
> > >>
> > >>
> > >> On 04/28/2014 07:42 PM, Anand Avati wrote:
> > >>
> > >>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
> > >>>
> > >>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
> > >>>> <ji...@apache.org> wrote:
> > >>>>
> > >>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly
> what
> > >>>>>
> > >>>> is encouraging the h2o work to be done a bit apart. It simply isn't
> > >>>> efficient to have to answer so many off-topic points whenever any
> reports
> > >>>> on work in progress are given.
> > >>>>
> > >>>>>
> > >>>>> I think this has been the off-topic here.
> > >>>>>
> > >>>>> Calling my comments "emotional" or "non-technical", or _loosely_
> > >>>>>
> > >>>> paraphrasing me.
> > >>>>
> > >>>> Yes, the personal finger-pointing parts don't belong and don't
> > >>>> convince anyone, let's skip those.
> > >>>>
> > >>>>
> > >>> +1. Let's skip those.
> > >>>
> > >>>
> > >>> From the sidelines, I see a bunch of work intended for Mahout
> > >>>
> > >>>> proceeding outside the community such as it is, and even Apache. Of
> > >>>> course, contributions are always prepped externally to some degree.
> I
> > >>>> create, debug, change patches before posting them, maybe checking in
> > >>>> early on choices that others may want input on.
> > >>>>
> > >>>> This is a large-ish change being proposed, IIUC. I can see one
> person
> > >>>> who publicly, and at least two who privately, have clear
> reservations
> > >>>> about this direction.
> > >>>>
> > >>>
> > >>>
> > >>> It will probably be a large-ish change, indeed. But my personal take
> is
> > >>> that, non-technical aspects of the debate is unfortunately taking
> > >>> precedence over real technical parts. Please refer to email thread
> "Mahout
> > >>> DSL vs Spark".
> > >>>
> > >>>
> > >>>
> > >>> It certainly appears funny vis-a-vis the "Apache
> > >>>> way" to work on a contribution *because* one (or more) other
> > >>>> committers aren't convinced.
> > >>>>
> > >>>>
> > >>> As mentioned in the referred email thread, a lot of the technical
> issues
> > >>> which got addressed in the work which was carried out outside of
> Apache,
> > >>> was really sorting out and highlighting build and classloader related
> > >>> challenges on the H2O side. There was little motivation to carry out
> those
> > >>> discussions on the Mahout lists as it was really ~99% H2O specific
> > >>> discussions and noise/spam to the Mahout community.
> > >>>
> > >>> I don't think that's important to dither about. What is, is this: if
> a
> > >>>
> > >>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
> > >>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
> > >>>> yet, but it seems like a quite possible outcome.
> > >>>>
> > >>>>
> > >>> As an outsider, my opinion is that the proposed need for a VOTE is a
> > >>> largely masqueraded problem built around the perception of
> disagreement
> > >>> over something vague, abstract and inaccurate. And therefore
> premature.
> > >>> That being said the PMC may vote on any issues/non-issues it may
> please.
> > >>>
> > >>> Would be a shame to do a lot of work, intending it for a commit, and
> > >>>
> > >>>> then find there is not consensus.
> > >>>>
> > >>>>
> > >>> Exactly the kind of inaccurate perception I meant. While we are (at
> least
> > >>> I
> > >>> am) exploring the best fit model for integration, and exploration by
> > >>> definition involves taking potentially wrong steps and backtracking
> if
> > >>> necessary, the perception unfortunately seems to be that the proposed
> > >>> intermediate (potentially wrong) steps are some kind of pre-decided
> plan
> > >>> of
> > >>> action. So, no, there WOULDN'T be a lot of work intended for a commit
> > >>> against consensus.
> > >>>
> > >>> So is it better to figure out earlier than later whether these 2+
> > >>>
> > >>>> parallel tracks have enough commonality to coexist?
> > >>>>
> > >>>
> > >>>
> > >>> Whether two parallel tracks (I assume the spark track and the H2O
> track?)
> > >>> have enough commonality to exist - one way you surely cannot get the
> right
> > >>> answer for this (except by co-incidence) is by taking a vote from a
> group
> > >>> who are experts in only either one of those tracks. From what I see,
> most
> > >>> of the opposition has been due to a combination of lack of
> understanding
> > >>> of
> > >>> H2O and (welcome) skepticism. If, as a contributor, I find there is
> no
> > >>> natural or beneficial way to co-exist with Spark, I wouldn't waste
> my time
> > >>> writing code, and for sure am not dependent on another group's vote
> to
> > >>> make
> > >>> that decision for me.
> > >>>
> > >>> Avati
> > >>>
> > >>>
> > >>
> > >
>
>

RE: Straw poll re: H2O ?

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I would echo Pat's sentiments spot on related to the goal of supporting both spark and H2O confusing folks that are interested in using, committing to and trying to understand where Mahout is headed small to medium term.   I hate to throw this out but given the amount of "sometimes not so nice back and forths I've seen on issue 1500" I really wonder whether we should have mahout-spark and mahout-h2o as two different top level projects potentially supporting a different set of algorithms underneath, yes I know tieing mahout to a particular technology goes against the initial vision but given the churn I'm seeing I'm not sure I understand what the current vision even is :)

> Subject: Re: Straw poll re: H2O ?
> From: pat@occamsmachete.com
> Date: Mon, 28 Apr 2014 13:17:03 -0700
> CC: ssc@apache.org
> To: dev@mahout.apache.org
> 
> I haven’t heard a good explanation of what this project is. There should be some small step like implementing an algo on h2o to takes the same input as a current Hadoop Mahout job and produce the same result or do one not already in Mahout. At least it will answer some technical questions and shouldn’t take a lot of support from current committers to produce.
> 
> I’m still not convinced that this is the primary thing that should drive making it a Mahout dependency.
> 
> I’m highly dubious of actively supporting and working on Mahout for Spark and h2o. Not for technical reasons but because rebooting Mahout on two platforms seems a non-starter. No project manager in the commercial world would allow that sort of thing. And rightly so, it confuses users, committers, contributors. You shouldn’t have a great deal of redundancy or competing efforts _inside_ a project even an open source one. That’s for separate projects and the incubator, right? There are plenty of examples of going that route, Spark itself is redundant with Hadoop in many ways. Would Apache accept h2o as a parallel project to Spark, if so why not do that?
> 
> Question: Where do we (Mahout user, committer, contributor) invest extremely precious time learning new languages, frameworks, architecture, configurations, optimizations?
> 
> Answer: Many will simply not choose but wait and see, or go elsewhere.
> 
> Why? Because we fail to communicate “the future of Mahout is Spark first—period” It keeps coming out "Spark and, well, h2o too”
> 
> That is a momentum killer.  If we’re agreed on “Spark first” then there’s no need to incubate Mahout 2, Spark and Mahout have already gone through that and though Dmitriy’s DSL and Scala shell work is entirely new, to the end user the jobs, input and output, and functionality will look like a v2. People dealing with internals will see a different world but they should be a minority of users and will hopefully like what they see.
> 
> 
> Somewhat off subject notes on external politics:
> 
> We really need to make sure Mahout stays in all the big distros. That means Sebastian’s comments are spot on: "The best way to help Mahout is to pick up some of the work that needs to be done with regards to documentation, examples, Hadoop 2 compatibility and designing the future, especially with regards to dataframes”  All the distros are hadoop 2.
> 
> Incubating Mahout 2 as another project is surely a way out of the distros, another momentum killer.
> 
> Another political question is whether an h2o dependency would be an issue to the distros. If we are going to put big efforts into h2o let’s see how that plays out first. Spark is already supported by them, even Hortonworks has taken a first step with 2.1. If Mahout is in a distro the distro will be asked to support it, that’s what they are paid for. Do they want to support h2o? I have no idea how they would react to that but it affects Mahout.
> 
> 
> For all these reasons I’d be -1 to any big-bang integration.
> 
> 
> > On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> > 
> > +1. I don't think anyone said anything, privately or publicly, about h20
> > integration being a bad idea. It's just there's more than one way to do it,
> > so debate is focusing on exploration of pluses and minuses of each
> > individual proposal (as they come to light). Part of difficulty here was
> > that the expertise intersection of all parts being connected and integrated
> > has been pretty poor on individual basis. So we have to go by scenarios
> > where a group of specialized experts tries to figure out the solution.
> > 
> > w.r.t to incubation proposals, it seems dubious for a number reasons.
> > 
> > Reason 1 is that these projects are the primary factor moving Mahout
> > anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
> > frankly not much left in Mahout, so it is reflection of more or less common
> > opinion that the project would just spiral down on its own if the things
> > stay status-quo.
> > 
> > Reason 2 is that there are good (not irreplaceable, but good) components in
> > Mahout that these efforts depend on. Therefore, incubation would be faced
> > with a perspective of having dependencies on project that on its own is
> > winding down. Not good for incubation side.
> > 
> > Reason 3 is that current effort is (IMO) minimalistic enough not to warrant
> > a new project. It simply doesn't, and can't have the scale of things like
> > Spark or Hadoop eco. There would be just not enough substance for a new
> > project at this point. I don't feel very strong about this point though.
> > 
> > 
> > On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org> wrote:
> > 
> >> We all should calm down here and remind ourselves why we are doing this
> >> whole thing: Because we love open source and want to have a vibrant
> >> community and a great piece of software.
> >> 
> >> Mahout has come a long way and is at a crossroads right now, so its only
> >> natural that there are heated discussions. But, we should immediately stop
> >> the fingerpointing and related stuff, we have managed to avoid this since
> >> Mahout's inception and we should continue to do so.
> >> 
> >> The best way to help Mahout is to pick up some of the work that needs to
> >> be done with regards to documentation, examples, Hadoop 2 compatibility and
> >> designing the future, especially with regards to dataframes e.g.
> >> 
> >> We agreed to give the h2O guys a shot for exploration of a possible
> >> integration into Mahout. We should be grateful that they are investing a
> >> lot of time into this, and should help whereever we can. Once they come up
> >> with a concrete proposal or patch, we will have a look at it, have a deep,
> >> technical and polite discussion, and make a decision afterwards.
> >> 
> >> --sebastian
> >> 
> >> 
> >> 
> >> 
> >> On 04/28/2014 07:42 PM, Anand Avati wrote:
> >> 
> >>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
> >>> 
> >>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
> >>>> <ji...@apache.org> wrote:
> >>>> 
> >>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly what
> >>>>> 
> >>>> is encouraging the h2o work to be done a bit apart. It simply isn't
> >>>> efficient to have to answer so many off-topic points whenever any reports
> >>>> on work in progress are given.
> >>>> 
> >>>>> 
> >>>>> I think this has been the off-topic here.
> >>>>> 
> >>>>> Calling my comments "emotional" or "non-technical", or _loosely_
> >>>>> 
> >>>> paraphrasing me.
> >>>> 
> >>>> Yes, the personal finger-pointing parts don't belong and don't
> >>>> convince anyone, let's skip those.
> >>>> 
> >>>> 
> >>> +1. Let's skip those.
> >>> 
> >>> 
> >>> From the sidelines, I see a bunch of work intended for Mahout
> >>> 
> >>>> proceeding outside the community such as it is, and even Apache. Of
> >>>> course, contributions are always prepped externally to some degree. I
> >>>> create, debug, change patches before posting them, maybe checking in
> >>>> early on choices that others may want input on.
> >>>> 
> >>>> This is a large-ish change being proposed, IIUC. I can see one person
> >>>> who publicly, and at least two who privately, have clear reservations
> >>>> about this direction.
> >>>> 
> >>> 
> >>> 
> >>> It will probably be a large-ish change, indeed. But my personal take is
> >>> that, non-technical aspects of the debate is unfortunately taking
> >>> precedence over real technical parts. Please refer to email thread "Mahout
> >>> DSL vs Spark".
> >>> 
> >>> 
> >>> 
> >>> It certainly appears funny vis-a-vis the "Apache
> >>>> way" to work on a contribution *because* one (or more) other
> >>>> committers aren't convinced.
> >>>> 
> >>>> 
> >>> As mentioned in the referred email thread, a lot of the technical issues
> >>> which got addressed in the work which was carried out outside of Apache,
> >>> was really sorting out and highlighting build and classloader related
> >>> challenges on the H2O side. There was little motivation to carry out those
> >>> discussions on the Mahout lists as it was really ~99% H2O specific
> >>> discussions and noise/spam to the Mahout community.
> >>> 
> >>> I don't think that's important to dither about. What is, is this: if a
> >>> 
> >>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
> >>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
> >>>> yet, but it seems like a quite possible outcome.
> >>>> 
> >>>> 
> >>> As an outsider, my opinion is that the proposed need for a VOTE is a
> >>> largely masqueraded problem built around the perception of disagreement
> >>> over something vague, abstract and inaccurate. And therefore premature.
> >>> That being said the PMC may vote on any issues/non-issues it may please.
> >>> 
> >>> Would be a shame to do a lot of work, intending it for a commit, and
> >>> 
> >>>> then find there is not consensus.
> >>>> 
> >>>> 
> >>> Exactly the kind of inaccurate perception I meant. While we are (at least
> >>> I
> >>> am) exploring the best fit model for integration, and exploration by
> >>> definition involves taking potentially wrong steps and backtracking if
> >>> necessary, the perception unfortunately seems to be that the proposed
> >>> intermediate (potentially wrong) steps are some kind of pre-decided plan
> >>> of
> >>> action. So, no, there WOULDN'T be a lot of work intended for a commit
> >>> against consensus.
> >>> 
> >>> So is it better to figure out earlier than later whether these 2+
> >>> 
> >>>> parallel tracks have enough commonality to coexist?
> >>>> 
> >>> 
> >>> 
> >>> Whether two parallel tracks (I assume the spark track and the H2O track?)
> >>> have enough commonality to exist - one way you surely cannot get the right
> >>> answer for this (except by co-incidence) is by taking a vote from a group
> >>> who are experts in only either one of those tracks. From what I see, most
> >>> of the opposition has been due to a combination of lack of understanding
> >>> of
> >>> H2O and (welcome) skepticism. If, as a contributor, I find there is no
> >>> natural or beneficial way to co-exist with Spark, I wouldn't waste my time
> >>> writing code, and for sure am not dependent on another group's vote to
> >>> make
> >>> that decision for me.
> >>> 
> >>> Avati
> >>> 
> >>> 
> >> 
> >

Re: Straw poll re: H2O ?

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I haven’t heard a good explanation of what this project is. There should be some small step like implementing an algo on h2o to takes the same input as a current Hadoop Mahout job and produce the same result or do one not already in Mahout. At least it will answer some technical questions and shouldn’t take a lot of support from current committers to produce.

I’m still not convinced that this is the primary thing that should drive making it a Mahout dependency.

I’m highly dubious of actively supporting and working on Mahout for Spark and h2o. Not for technical reasons but because rebooting Mahout on two platforms seems a non-starter. No project manager in the commercial world would allow that sort of thing. And rightly so, it confuses users, committers, contributors. You shouldn’t have a great deal of redundancy or competing efforts _inside_ a project even an open source one. That’s for separate projects and the incubator, right? There are plenty of examples of going that route, Spark itself is redundant with Hadoop in many ways. Would Apache accept h2o as a parallel project to Spark, if so why not do that?

Question: Where do we (Mahout user, committer, contributor) invest extremely precious time learning new languages, frameworks, architecture, configurations, optimizations?

Answer: Many will simply not choose but wait and see, or go elsewhere.

Why? Because we fail to communicate “the future of Mahout is Spark first—period” It keeps coming out "Spark and, well, h2o too”

That is a momentum killer.  If we’re agreed on “Spark first” then there’s no need to incubate Mahout 2, Spark and Mahout have already gone through that and though Dmitriy’s DSL and Scala shell work is entirely new, to the end user the jobs, input and output, and functionality will look like a v2. People dealing with internals will see a different world but they should be a minority of users and will hopefully like what they see.


Somewhat off subject notes on external politics:

We really need to make sure Mahout stays in all the big distros. That means Sebastian’s comments are spot on: "The best way to help Mahout is to pick up some of the work that needs to be done with regards to documentation, examples, Hadoop 2 compatibility and designing the future, especially with regards to dataframes”  All the distros are hadoop 2.

Incubating Mahout 2 as another project is surely a way out of the distros, another momentum killer.

Another political question is whether an h2o dependency would be an issue to the distros. If we are going to put big efforts into h2o let’s see how that plays out first. Spark is already supported by them, even Hortonworks has taken a first step with 2.1. If Mahout is in a distro the distro will be asked to support it, that’s what they are paid for. Do they want to support h2o? I have no idea how they would react to that but it affects Mahout.


For all these reasons I’d be -1 to any big-bang integration.


> On Apr 28, 2014, at 11:50 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> +1. I don't think anyone said anything, privately or publicly, about h20
> integration being a bad idea. It's just there's more than one way to do it,
> so debate is focusing on exploration of pluses and minuses of each
> individual proposal (as they come to light). Part of difficulty here was
> that the expertise intersection of all parts being connected and integrated
> has been pretty poor on individual basis. So we have to go by scenarios
> where a group of specialized experts tries to figure out the solution.
> 
> w.r.t to incubation proposals, it seems dubious for a number reasons.
> 
> Reason 1 is that these projects are the primary factor moving Mahout
> anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
> frankly not much left in Mahout, so it is reflection of more or less common
> opinion that the project would just spiral down on its own if the things
> stay status-quo.
> 
> Reason 2 is that there are good (not irreplaceable, but good) components in
> Mahout that these efforts depend on. Therefore, incubation would be faced
> with a perspective of having dependencies on project that on its own is
> winding down. Not good for incubation side.
> 
> Reason 3 is that current effort is (IMO) minimalistic enough not to warrant
> a new project. It simply doesn't, and can't have the scale of things like
> Spark or Hadoop eco. There would be just not enough substance for a new
> project at this point. I don't feel very strong about this point though.
> 
> 
> On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> We all should calm down here and remind ourselves why we are doing this
>> whole thing: Because we love open source and want to have a vibrant
>> community and a great piece of software.
>> 
>> Mahout has come a long way and is at a crossroads right now, so its only
>> natural that there are heated discussions. But, we should immediately stop
>> the fingerpointing and related stuff, we have managed to avoid this since
>> Mahout's inception and we should continue to do so.
>> 
>> The best way to help Mahout is to pick up some of the work that needs to
>> be done with regards to documentation, examples, Hadoop 2 compatibility and
>> designing the future, especially with regards to dataframes e.g.
>> 
>> We agreed to give the h2O guys a shot for exploration of a possible
>> integration into Mahout. We should be grateful that they are investing a
>> lot of time into this, and should help whereever we can. Once they come up
>> with a concrete proposal or patch, we will have a look at it, have a deep,
>> technical and polite discussion, and make a decision afterwards.
>> 
>> --sebastian
>> 
>> 
>> 
>> 
>> On 04/28/2014 07:42 PM, Anand Avati wrote:
>> 
>>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
>>> 
>>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
>>>> <ji...@apache.org> wrote:
>>>> 
>>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly what
>>>>> 
>>>> is encouraging the h2o work to be done a bit apart. It simply isn't
>>>> efficient to have to answer so many off-topic points whenever any reports
>>>> on work in progress are given.
>>>> 
>>>>> 
>>>>> I think this has been the off-topic here.
>>>>> 
>>>>> Calling my comments "emotional" or "non-technical", or _loosely_
>>>>> 
>>>> paraphrasing me.
>>>> 
>>>> Yes, the personal finger-pointing parts don't belong and don't
>>>> convince anyone, let's skip those.
>>>> 
>>>> 
>>> +1. Let's skip those.
>>> 
>>> 
>>> From the sidelines, I see a bunch of work intended for Mahout
>>> 
>>>> proceeding outside the community such as it is, and even Apache. Of
>>>> course, contributions are always prepped externally to some degree. I
>>>> create, debug, change patches before posting them, maybe checking in
>>>> early on choices that others may want input on.
>>>> 
>>>> This is a large-ish change being proposed, IIUC. I can see one person
>>>> who publicly, and at least two who privately, have clear reservations
>>>> about this direction.
>>>> 
>>> 
>>> 
>>> It will probably be a large-ish change, indeed. But my personal take is
>>> that, non-technical aspects of the debate is unfortunately taking
>>> precedence over real technical parts. Please refer to email thread "Mahout
>>> DSL vs Spark".
>>> 
>>> 
>>> 
>>> It certainly appears funny vis-a-vis the "Apache
>>>> way" to work on a contribution *because* one (or more) other
>>>> committers aren't convinced.
>>>> 
>>>> 
>>> As mentioned in the referred email thread, a lot of the technical issues
>>> which got addressed in the work which was carried out outside of Apache,
>>> was really sorting out and highlighting build and classloader related
>>> challenges on the H2O side. There was little motivation to carry out those
>>> discussions on the Mahout lists as it was really ~99% H2O specific
>>> discussions and noise/spam to the Mahout community.
>>> 
>>> I don't think that's important to dither about. What is, is this: if a
>>> 
>>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
>>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
>>>> yet, but it seems like a quite possible outcome.
>>>> 
>>>> 
>>> As an outsider, my opinion is that the proposed need for a VOTE is a
>>> largely masqueraded problem built around the perception of disagreement
>>> over something vague, abstract and inaccurate. And therefore premature.
>>> That being said the PMC may vote on any issues/non-issues it may please.
>>> 
>>> Would be a shame to do a lot of work, intending it for a commit, and
>>> 
>>>> then find there is not consensus.
>>>> 
>>>> 
>>> Exactly the kind of inaccurate perception I meant. While we are (at least
>>> I
>>> am) exploring the best fit model for integration, and exploration by
>>> definition involves taking potentially wrong steps and backtracking if
>>> necessary, the perception unfortunately seems to be that the proposed
>>> intermediate (potentially wrong) steps are some kind of pre-decided plan
>>> of
>>> action. So, no, there WOULDN'T be a lot of work intended for a commit
>>> against consensus.
>>> 
>>> So is it better to figure out earlier than later whether these 2+
>>> 
>>>> parallel tracks have enough commonality to coexist?
>>>> 
>>> 
>>> 
>>> Whether two parallel tracks (I assume the spark track and the H2O track?)
>>> have enough commonality to exist - one way you surely cannot get the right
>>> answer for this (except by co-incidence) is by taking a vote from a group
>>> who are experts in only either one of those tracks. From what I see, most
>>> of the opposition has been due to a combination of lack of understanding
>>> of
>>> H2O and (welcome) skepticism. If, as a contributor, I find there is no
>>> natural or beneficial way to co-exist with Spark, I wouldn't waste my time
>>> writing code, and for sure am not dependent on another group's vote to
>>> make
>>> that decision for me.
>>> 
>>> Avati
>>> 
>>> 
>> 
>

Re: Straw poll re: H2O ?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

+1. I don't think anyone said anything, privately or publicly, about h20
integration being a bad idea. It's just there's more than one way to do it,
so debate is focusing on exploration of pluses and minuses of each
individual proposal (as they come to light). Part of difficulty here was
that the expertise intersection of all parts being connected and integrated
has been pretty poor on individual basis. So we have to go by scenarios
where a group of specialized experts tries to figure out the solution.

w.r.t to incubation proposals, it seems dubious for a number reasons.

Reason 1 is that these projects are the primary factor moving Mahout
anywhere forward. Without them, given "bye-bye mapreduce" jira, there's
frankly not much left in Mahout, so it is reflection of more or less common
opinion that the project would just spiral down on its own if the things
stay status-quo.

Reason 2 is that there are good (not irreplaceable, but good) components in
Mahout that these efforts depend on. Therefore, incubation would be faced
with a perspective of having dependencies on project that on its own is
winding down. Not good for incubation side.

Reason 3 is that current effort is (IMO) minimalistic enough not to warrant
a new project. It simply doesn't, and can't have the scale of things like
Spark or Hadoop eco. There would be just not enough substance for a new
project at this point. I don't feel very strong about this point though.


On Mon, Apr 28, 2014 at 11:09 AM, Sebastian Schelter <ss...@apache.org> wrote:

> We all should calm down here and remind ourselves why we are doing this
> whole thing: Because we love open source and want to have a vibrant
> community and a great piece of software.
>
> Mahout has come a long way and is at a crossroads right now, so its only
> natural that there are heated discussions. But, we should immediately stop
> the fingerpointing and related stuff, we have managed to avoid this since
> Mahout's inception and we should continue to do so.
>
> The best way to help Mahout is to pick up some of the work that needs to
> be done with regards to documentation, examples, Hadoop 2 compatibility and
> designing the future, especially with regards to dataframes e.g.
>
> We agreed to give the h2O guys a shot for exploration of a possible
> integration into Mahout. We should be grateful that they are investing a
> lot of time into this, and should help whereever we can. Once they come up
> with a concrete proposal or patch, we will have a look at it, have a deep,
> technical and polite discussion, and make a decision afterwards.
>
> --sebastian
>
>
>
>
> On 04/28/2014 07:42 PM, Anand Avati wrote:
>
>> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
>>
>>  On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
>>> <ji...@apache.org> wrote:
>>>
>>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly what
>>>>
>>> is encouraging the h2o work to be done a bit apart. It simply isn't
>>> efficient to have to answer so many off-topic points whenever any reports
>>> on work in progress are given.
>>>
>>>>
>>>> I think this has been the off-topic here.
>>>>
>>>> Calling my comments "emotional" or "non-technical", or _loosely_
>>>>
>>> paraphrasing me.
>>>
>>> Yes, the personal finger-pointing parts don't belong and don't
>>> convince anyone, let's skip those.
>>>
>>>
>> +1. Let's skip those.
>>
>>
>>  From the sidelines, I see a bunch of work intended for Mahout
>>
>>> proceeding outside the community such as it is, and even Apache. Of
>>> course, contributions are always prepped externally to some degree. I
>>> create, debug, change patches before posting them, maybe checking in
>>> early on choices that others may want input on.
>>>
>>> This is a large-ish change being proposed, IIUC. I can see one person
>>> who publicly, and at least two who privately, have clear reservations
>>> about this direction.
>>>
>>
>>
>> It will probably be a large-ish change, indeed. But my personal take is
>> that, non-technical aspects of the debate is unfortunately taking
>> precedence over real technical parts. Please refer to email thread "Mahout
>> DSL vs Spark".
>>
>>
>>
>>  It certainly appears funny vis-a-vis the "Apache
>>> way" to work on a contribution *because* one (or more) other
>>> committers aren't convinced.
>>>
>>>
>> As mentioned in the referred email thread, a lot of the technical issues
>> which got addressed in the work which was carried out outside of Apache,
>> was really sorting out and highlighting build and classloader related
>> challenges on the H2O side. There was little motivation to carry out those
>> discussions on the Mahout lists as it was really ~99% H2O specific
>> discussions and noise/spam to the Mahout community.
>>
>> I don't think that's important to dither about. What is, is this: if a
>>
>>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
>>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
>>> yet, but it seems like a quite possible outcome.
>>>
>>>
>> As an outsider, my opinion is that the proposed need for a VOTE is a
>> largely masqueraded problem built around the perception of disagreement
>> over something vague, abstract and inaccurate. And therefore premature.
>> That being said the PMC may vote on any issues/non-issues it may please.
>>
>> Would be a shame to do a lot of work, intending it for a commit, and
>>
>>> then find there is not consensus.
>>>
>>>
>> Exactly the kind of inaccurate perception I meant. While we are (at least
>> I
>> am) exploring the best fit model for integration, and exploration by
>> definition involves taking potentially wrong steps and backtracking if
>> necessary, the perception unfortunately seems to be that the proposed
>> intermediate (potentially wrong) steps are some kind of pre-decided plan
>> of
>> action. So, no, there WOULDN'T be a lot of work intended for a commit
>> against consensus.
>>
>> So is it better to figure out earlier than later whether these 2+
>>
>>> parallel tracks have enough commonality to coexist?
>>>
>>
>>
>> Whether two parallel tracks (I assume the spark track and the H2O track?)
>> have enough commonality to exist - one way you surely cannot get the right
>> answer for this (except by co-incidence) is by taking a vote from a group
>> who are experts in only either one of those tracks. From what I see, most
>> of the opposition has been due to a combination of lack of understanding
>> of
>> H2O and (welcome) skepticism. If, as a contributor, I find there is no
>> natural or beneficial way to co-exist with Spark, I wouldn't waste my time
>> writing code, and for sure am not dependent on another group's vote to
>> make
>> that decision for me.
>>
>> Avati
>>
>>
>

Re: Straw poll re: H2O ?

Posted by Sebastian Schelter <ss...@apache.org>.

We all should calm down here and remind ourselves why we are doing this 
whole thing: Because we love open source and want to have a vibrant 
community and a great piece of software.

Mahout has come a long way and is at a crossroads right now, so its only 
natural that there are heated discussions. But, we should immediately 
stop the fingerpointing and related stuff, we have managed to avoid this 
since Mahout's inception and we should continue to do so.

The best way to help Mahout is to pick up some of the work that needs to 
be done with regards to documentation, examples, Hadoop 2 compatibility 
and designing the future, especially with regards to dataframes e.g.

We agreed to give the h2O guys a shot for exploration of a possible 
integration into Mahout. We should be grateful that they are investing a 
lot of time into this, and should help whereever we can. Once they come 
up with a concrete proposal or patch, we will have a look at it, have a 
deep, technical and polite discussion, and make a decision afterwards.

--sebastian



On 04/28/2014 07:42 PM, Anand Avati wrote:
> On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
>> <ji...@apache.org> wrote:
>>> bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly what
>> is encouraging the h2o work to be done a bit apart. It simply isn't
>> efficient to have to answer so many off-topic points whenever any reports
>> on work in progress are given.
>>>
>>> I think this has been the off-topic here.
>>>
>>> Calling my comments "emotional" or "non-technical", or _loosely_
>> paraphrasing me.
>>
>> Yes, the personal finger-pointing parts don't belong and don't
>> convince anyone, let's skip those.
>>
>
> +1. Let's skip those.
>
>
>  From the sidelines, I see a bunch of work intended for Mahout
>> proceeding outside the community such as it is, and even Apache. Of
>> course, contributions are always prepped externally to some degree. I
>> create, debug, change patches before posting them, maybe checking in
>> early on choices that others may want input on.
>>
>> This is a large-ish change being proposed, IIUC. I can see one person
>> who publicly, and at least two who privately, have clear reservations
>> about this direction.
>
>
> It will probably be a large-ish change, indeed. But my personal take is
> that, non-technical aspects of the debate is unfortunately taking
> precedence over real technical parts. Please refer to email thread "Mahout
> DSL vs Spark".
>
>
>
>> It certainly appears funny vis-a-vis the "Apache
>> way" to work on a contribution *because* one (or more) other
>> committers aren't convinced.
>>
>
> As mentioned in the referred email thread, a lot of the technical issues
> which got addressed in the work which was carried out outside of Apache,
> was really sorting out and highlighting build and classloader related
> challenges on the H2O side. There was little motivation to carry out those
> discussions on the Mahout lists as it was really ~99% H2O specific
> discussions and noise/spam to the Mahout community.
>
> I don't think that's important to dither about. What is, is this: if a
>> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
>> Nobody can pre-judge his/her opinion on a proposal that's not tabled
>> yet, but it seems like a quite possible outcome.
>>
>
> As an outsider, my opinion is that the proposed need for a VOTE is a
> largely masqueraded problem built around the perception of disagreement
> over something vague, abstract and inaccurate. And therefore premature.
> That being said the PMC may vote on any issues/non-issues it may please.
>
> Would be a shame to do a lot of work, intending it for a commit, and
>> then find there is not consensus.
>>
>
> Exactly the kind of inaccurate perception I meant. While we are (at least I
> am) exploring the best fit model for integration, and exploration by
> definition involves taking potentially wrong steps and backtracking if
> necessary, the perception unfortunately seems to be that the proposed
> intermediate (potentially wrong) steps are some kind of pre-decided plan of
> action. So, no, there WOULDN'T be a lot of work intended for a commit
> against consensus.
>
> So is it better to figure out earlier than later whether these 2+
>> parallel tracks have enough commonality to coexist?
>
>
> Whether two parallel tracks (I assume the spark track and the H2O track?)
> have enough commonality to exist - one way you surely cannot get the right
> answer for this (except by co-incidence) is by taking a vote from a group
> who are experts in only either one of those tracks. From what I see, most
> of the opposition has been due to a combination of lack of understanding of
> H2O and (welcome) skepticism. If, as a contributor, I find there is no
> natural or beneficial way to co-exist with Spark, I wouldn't waste my time
> writing code, and for sure am not dependent on another group's vote to make
> that decision for me.
>
> Avati
>

Re: Straw poll re: H2O ?

Posted by Anand Avati <av...@gluster.org>.

On Mon, Apr 28, 2014 at 2:18 AM, Sean Owen <sr...@gmail.com> wrote:

> On Mon, Apr 28, 2014 at 3:39 AM, Dmitriy Lyubimov (JIRA)
> <ji...@apache.org> wrote:
> > bq. The emotional tenor of Dmitriy Lyubimov's comments are exactly what
> is encouraging the h2o work to be done a bit apart. It simply isn't
> efficient to have to answer so many off-topic points whenever any reports
> on work in progress are given.
> >
> > I think this has been the off-topic here.
> >
> > Calling my comments "emotional" or "non-technical", or _loosely_
> paraphrasing me.
>
> Yes, the personal finger-pointing parts don't belong and don't
> convince anyone, let's skip those.
>

+1. Let's skip those.

>From the sidelines, I see a bunch of work intended for Mahout
> proceeding outside the community such as it is, and even Apache. Of
> course, contributions are always prepped externally to some degree. I
> create, debug, change patches before posting them, maybe checking in
> early on choices that others may want input on.
>
> This is a large-ish change being proposed, IIUC. I can see one person
> who publicly, and at least two who privately, have clear reservations
> about this direction.

It will probably be a large-ish change, indeed. But my personal take is
that, non-technical aspects of the debate is unfortunately taking
precedence over real technical parts. Please refer to email thread "Mahout
DSL vs Spark".

> It certainly appears funny vis-a-vis the "Apache
> way" to work on a contribution *because* one (or more) other
> committers aren't convinced.
>

As mentioned in the referred email thread, a lot of the technical issues
which got addressed in the work which was carried out outside of Apache,
was really sorting out and highlighting build and classloader related
challenges on the H2O side. There was little motivation to carry out those
discussions on the Mahout lists as it was really ~99% H2O specific
discussions and noise/spam to the Mahout community.

I don't think that's important to dither about. What is, is this: if a
> big-bang patch landed tomorrow, I wonder if it would pass a VOTE?
> Nobody can pre-judge his/her opinion on a proposal that's not tabled
> yet, but it seems like a quite possible outcome.
>

As an outsider, my opinion is that the proposed need for a VOTE is a
largely masqueraded problem built around the perception of disagreement
over something vague, abstract and inaccurate. And therefore premature.
That being said the PMC may vote on any issues/non-issues it may please.

Would be a shame to do a lot of work, intending it for a commit, and
> then find there is not consensus.
>

Exactly the kind of inaccurate perception I meant. While we are (at least I
am) exploring the best fit model for integration, and exploration by
definition involves taking potentially wrong steps and backtracking if
necessary, the perception unfortunately seems to be that the proposed
intermediate (potentially wrong) steps are some kind of pre-decided plan of
action. So, no, there WOULDN'T be a lot of work intended for a commit
against consensus.

So is it better to figure out earlier than later whether these 2+
> parallel tracks have enough commonality to coexist?

Whether two parallel tracks (I assume the spark track and the H2O track?)
have enough commonality to exist - one way you surely cannot get the right
answer for this (except by co-incidence) is by taking a vote from a group
who are experts in only either one of those tracks. From what I see, most
of the opposition has been due to a combination of lack of understanding of
H2O and (welcome) skepticism. If, as a contributor, I find there is no
natural or beneficial way to co-exist with Spark, I wouldn't waste my time
writing code, and for sure am not dependent on another group's vote to make
that decision for me.

Avati