You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/07/12 19:57:58 UTC

Call for vote on integrating h2o

Why not put this argument to bed with a vote? Sraw poll or not it will make the consensus visible so we can get on with things. I know that many are on vacation now but please take time to vote we really need a large sample of active committers. Feel free to give a short defense of you position too. I further propose we keep this at the 1000 meter level and not start quoting code—let’s look at the forest instead of the trees.

The choice as far as I can tell is:

1) merge the h2o implementation of math-scala and h2o modules into mainstream Mahout. I suppose this implies accepting h2o specific code too, though someone can contradict me here.
2) support h2o in integrating math and math-scala with their engine project (even as an artifact) and be welcoming and responsive with this support.
3) break the DSL into it’s own project, give it a name like Mahout-core, make all tests engine independent or live in other project code (like h2o or Flink). Then All-the-rest implements on Spark (the rest of Mahout), h2o, or Flink. This is the linux kernel approach, many distros but one kernel.

I support #2. The reasons:

1) engine specific work should be done by the experts and work done on one engine should never affect work done on another. 
2) math-scala is the closest thing to engine independent thing we have but it is not complete. Changes to it will need to be negotiated and cannot be forced into a single commit as they would if breakage in h2o also breaks the build.
3) Every committer should not have to understand all engines. Currently work, outside the DSL or not, often requires additions to the DSL and also often require the committers to pick an engine or design a new abstraction. This work of finding abstractions should not be forced into a single commit.
4) Mahout gets no known advantage by merging this PR. The alternative is that h2o merge it with their project. We still get the benefit of being (at least at the algebra/ r-like api / DSL) a multi-engine project. In other words we have proven our stated desire to support other engines.
5) Be welcoming. Providing a key component with the optimizer and DSL (along with all future improvements) to any and all engines and agreeing to support it and jointly work to keep it core seem very supportive of the open source community and mentality. There are many ways to work together and some bad ones.
6) Keeping the engine work separated by project boundaries but supported by mutual PRs will be a much more maintainable and productive way to cooperate.  This is the model of choice for most modern OSS project, especially on Github. Git was made for this.
7) When Flink (Stratosphere) looks at cooperating with Mahout as they have already indicated, isn’t option #2 a much better way to deal with them too. Again the burden of integration should be with the engine, not Mahout. By merging h2o we would be committing to merging every other viable engine. It’s a slippery slope that the DSL alone may be able to pull off but not a core team supporting every engine.

I don’t favor #3 because the DSL is not complete and Mahout Spark as it’s reference implementation should have the easiest path to modify it. Maybe some day this will be the better alternative.

A word about bone fides. I’m one of a vey small number of people to push Scala or Spark code. I’m working on ItemSimilarity and a framework for readers/writers for tuples and DRMs (text-delimited is the first) as well as the core cooccurrence, whose primary author was Sebastian. Plans include a revamp of the item-based recommenders based on earlier hadoop+mahout+solr work. My work is generally outside the DSL but has required several changes or additions to it.

Re: Call for vote on integrating h2o

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Jul 14, 2014 at 9:36 AM, Pat Ferrel <pa...@gmail.com> wrote:

> 1) every change to the DSL should be implemented either in core-math or in
> _both_ engines, right? So every committer will have to be willing to take
> this on when changing the DSL, right? We don’t want divergence in DSL
> implementation.
>

Well, I think that every committer should sign up to build a bit of a
consortium of engine-oriented committers to handle the change.  There will
be some specialization before long.


> 2) are we going to allow the build to be broken for extended periods
> (hopefully only a day or two) until one or the other expert gets time to
> help with a DSL implementation?
>

No.  I think that the original committer should insert a stub
implementation that throws an exception and file a JIRA.  The unit test for
the capability may have to be limited temporarily, but the build should not
break.  The engine-doesn't-do-this JIRA should be a release stopper.



> This is for cases where #1 is not possible. This will happen with both
> tests and abstract defs in core-math that are carried through other engine
> specific classes. The way to get things to compile may not be immediately
> obvious so to keep things going a profile or target for each engine might
> help.
>

Profile is an interesting idea.


> 3) This will create an instant split in what algos are implemented on h2o
> and spark. We should clearly mark these and ideally minimize them.
>

Agree.


>  4) Users are going to be confused. Do they need to install Spark or not,
> what runs on what, what are the differences? The ideal is to say it all
> runs on both so all users have to do is choose their engine. But that may
> never happen. How do we handle this? There is coming confusion over Hadoop
> mr vs Spark, we don’t want to add to this.
>

Fair point.  Just like the confusion between XFS and EXT3 and EXT4 and ZFS.
 Needs documentation.


>  5) Can we agree on file level formats and/or other ways to pass a
> parallelized drm from one engine to the other? This will allow us to create
> hybrid pipelines, potentially easing user confusion.
>

I want to avoid file level data communication as much as possible.

Will it be possible to make the file handling generic?  I can see how it
might be and how it might not be possible.  Can we push the file handling
back on the user?  Can we only support a few persistence technologies (say,
local file, hdfs and URL)?

Re: Call for vote on integrating h2o

Posted by Pat Ferrel <pa...@gmail.com>.
Agree with Suneel and Ted; Let’s rise!

Hmm, but that doesn’t seem to be the question. Still I agree

Ok I’ll read between the lines and assume you are voting for #1. Can we see if there is further agreement on ground rules because it’ll be hard to rise if we create a mess.

1) every change to the DSL should be implemented either in core-math or in _both_ engines, right? So every committer will have to be willing to take this on when changing the DSL, right? We don’t want divergence in DSL implementation. 
2) are we going to allow the build to be broken for extended periods (hopefully only a day or two) until one or the other expert gets time to help with a DSL implementation? This is for cases where #1 is not possible. This will happen with both tests and abstract defs in core-math that are carried through other engine specific classes. The way to get things to compile may not be immediately obvious so to keep things going a profile or target for each engine might help.
3) This will create an instant split in what algos are implemented on h2o and spark. We should clearly mark these and ideally minimize them.
4) Users are going to be confused. Do they need to install Spark or not, what runs on what, what are the differences? The ideal is to say it all runs on both so all users have to do is choose their engine. But that may never happen. How do we handle this? There is coming confusion over Hadoop mr vs Spark, we don’t want to add to this.
5) Can we agree on file level formats and/or other ways to pass a parallelized drm from one engine to the other? This will allow us to create hybrid pipelines, potentially easing user confusion.


On Jul 13, 2014, at 3:18 PM, Suneel Marthi <sm...@apache.org> wrote:

Agree with Ted. Let's rise.


On Sun, Jul 13, 2014 at 6:10 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sun, Jul 13, 2014 at 1:42 PM, Anand Avati <av...@gluster.org> wrote:
> 
>> I will accept the verdict of the vote no matter what it is. It is your
>> project after all.
>> 
> 
> I think we need to take Anand's words to heart here.
> 
> We will rise together or fall apart.
> 
> Let's rise.
> 


Re: Call for vote on integrating h2o

Posted by Suneel Marthi <sm...@apache.org>.
Agree with Ted. Let's rise.


On Sun, Jul 13, 2014 at 6:10 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sun, Jul 13, 2014 at 1:42 PM, Anand Avati <av...@gluster.org> wrote:
>
> > I will accept the verdict of the vote no matter what it is. It is your
> > project after all.
> >
>
> I think we need to take Anand's words to heart here.
>
> We will rise together or fall apart.
>
> Let's rise.
>

Re: Call for vote on integrating h2o

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Jul 13, 2014 at 1:42 PM, Anand Avati <av...@gluster.org> wrote:

> I will accept the verdict of the vote no matter what it is. It is your
> project after all.
>

I think we need to take Anand's words to heart here.

We will rise together or fall apart.

Let's rise.

Re: Call for vote on integrating h2o

Posted by Anand Avati <av...@gluster.org>.
On Saturday, July 12, 2014, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Why not put this argument to bed with a vote? Sraw poll or not it will
> make the consensus visible so we can get on with things. I know that many
> are on vacation now but please take time to vote we really need a large
> sample of active committers. Feel free to give a short defense of you
> position too. I further propose we keep this at the 1000 meter level and
> not start quoting code—let’s look at the forest instead of the trees.
>
> The choice as far as I can tell is:
>
> 1) merge the h2o implementation of math-scala and h2o modules into
> mainstream Mahout. I suppose this implies accepting h2o specific code too,
> though someone can contradict me here.
> 2) support h2o in integrating math and math-scala with their engine
> project (even as an artifact) and be welcoming and responsive with this
> support.
> 3) break the DSL into it’s own project, give it a name like Mahout-core,
> make all tests engine independent or live in other project code (like h2o
> or Flink). Then All-the-rest implements on Spark (the rest of Mahout), h2o,
> or Flink. This is the linux kernel approach, many distros but one kernel.
>

If one wants to really draw parallels to the linux world, the linux  itself
is a far better example, especially vfs and file systems of which I am
intimately familiar. You have a generic vfs which implemenrs filesystem
semantics (a logical layer), and a bunch of filesystems which translate
them into on-disk or over-network operations each in their own way
(physical layers). The kernel team has no different problems than the type
you mention. There are experts in just their fs who dont care how another
fs works, and vfs experts who dont care how any fs implements internally.
Yet they all work off the same linux.git. Sometimes changes done in vfs
results in changes to all fses which the author does not know any internals
about. Yet it all works. Very similar story with device drivers.

How does it work there? Because all developers and maintainers work
together with the spirit of collaboration. Of course device drivers are
based on proprierary hardware whose internals are not well known or
published. Of course those device drivers and fses can reside as sparate
projects as the kernel allows loadable modules. But why exist together?
Because it forces all components to stay in sync and not drift apart as api
changes are made. This in turn makes the life of the consumers of the
project much easier and that is the most important goal for a project.

You can argue and provide more reasons than what you already gave against
merge, and vice versa.  It finally boils down to the attitudes of the
project maintainers towards open source collaboration.

I will accept the verdict of the vote no matter what it is. It is your
project after all.

Thanks


I support #2. The reasons:
>
> 1) engine specific work should be done by the experts and work done on one
> engine should never affect work done on another.
> 2) math-scala is the closest thing to engine independent thing we have but
> it is not complete. Changes to it will need to be negotiated and cannot be
> forced into a single commit as they would if breakage in h2o also breaks
> the build.
> 3) Every committer should not have to understand all engines. Currently
> work, outside the DSL or not, often requires additions to the DSL and also
> often require the committers to pick an engine or design a new abstraction.
> This work of finding abstractions should not be forced into a single commit.
> 4) Mahout gets no known advantage by merging this PR. The alternative is
> that h2o merge it with their project. We still get the benefit of being (at
> least at the algebra/ r-like api / DSL) a multi-engine project. In other
> words we have proven our stated desire to support other engines.
> 5) Be welcoming. Providing a key component with the optimizer and DSL
> (along with all future improvements) to any and all engines and agreeing to
> support it and jointly work to keep it core seem very supportive of the
> open source community and mentality. There are many ways to work together
> and some bad ones.
> 6) Keeping the engine work separated by project boundaries but supported
> by mutual PRs will be a much more maintainable and productive way to
> cooperate.  This is the model of choice for most modern OSS project,
> especially on Github. Git was made for this.
> 7) When Flink (Stratosphere) looks at cooperating with Mahout as they have
> already indicated, isn’t option #2 a much better way to deal with them too.
> Again the burden of integration should be with the engine, not Mahout. By
> merging h2o we would be committing to merging every other viable engine.
> It’s a slippery slope that the DSL alone may be able to pull off but not a
> core team supporting every engine.
>
> I don’t favor #3 because the DSL is not complete and Mahout Spark as it’s
> reference implementation should have the easiest path to modify it. Maybe
> some day this will be the better alternative.
>
> A word about bone fides. I’m one of a vey small number of people to push
> Scala or Spark code. I’m working on ItemSimilarity and a framework for
> readers/writers for tuples and DRMs (text-delimited is the first) as well
> as the core cooccurrence, whose primary author was Sebastian. Plans include
> a revamp of the item-based recommenders based on earlier hadoop+mahout+solr
> work. My work is generally outside the DSL but has required several changes
> or additions to it.