You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/06/18 22:29:54 UTC

Engine specific algos

Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random forests

> Also, we don't have any mappings for Spark Streaming -- so if your
> implementation heavily relies on Spark streaming, i think Spark itself is
> the right place for it to be a part of.

We are discouraging engine specific work? Even dismissing Spark Streaming as a whole?

> As it stands we don't have purely (c) methods and indeed i believe these
> methods may be totally engine-specific in which case mllib is one of
> possibly good homes for them. 

Adherence to a specific incarnation of an engine-neutral DSL has become a requirement for inclusion in Mahout? The current DSL cannot be extended? Or it can’t be extended with engine specific ways? Or it can’t be extended with Spark Streaming? I would have thought all of these things desirable otherwise we are limiting ourselves to a subset of what an engine can do or a subset of problems that the current DSL supports. 

I hope I’m misreading this but it looks like we just discourage a contributor from adding post hadoop code in an interesting area to Mahout?

Re: Engine specific algos

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Jun 18, 2014 at 5:29 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> > Finally, I don't view our mission as limited to the DSL work.  We should
> > also accept/generate useful applications of the DSL.
> >
> >
> I thought that's what i said as well.

Cool.  I had thought that this was consensus, but got a different
impression from you last comments.

All's well that ends with consensus to misquote the Bard.

> I'd be happy if we could demonstrate
> a custom e2e application using mahout components for feature extraction,
> vectorization, solution and postprocessing in a few month. In that sense
> the woefully missing stuff here is feature extraction and frames.
>

Indeed.  I am seriously hoping that my day job drops below 60 hours soon so
I can spend some time on the feature extraction part.

> real time streaming such as Spark streaming would requite a bit more
> thinking, it is not clear to me  how we could abstract such capabilities in
> engine-independent way at this point, and even if there's much merit in
> doing that rather than writing algorithms directly for streaming, if that's
> the idea here. But streaming algorithms still can mix in other mahout
> components, which would make them quasi-mahout algorithms of second kind i
> mentioned.

Yeah... I think that we have agreement here as well ... the Spark Streaming
application writers are more likely to be consumers of Mahout than
suppliers of additional algorithms.

Re: Engine specific algos

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Wed, Jun 18, 2014 at 4:58 PM, Ted Dunning <te...@gmail.com> wrote:

> My own take is quite similar but a little different.
>
> ...
>

> Finally, I don't view our mission as limited to the DSL work.  We should
> also accept/generate useful applications of the DSL.
>
>
I thought that's what i said as well. I'd be happy if we could demonstrate
a custom e2e application using mahout components for feature extraction,
vectorization, solution and postprocessing in a few month. In that sense
the woefully missing stuff here is feature extraction and frames.

real time streaming such as Spark streaming would requite a bit more
thinking, it is not clear to me  how we could abstract such capabilities in
engine-independent way at this point, and even if there's much merit in
doing that rather than writing algorithms directly for streaming, if that's
the idea here. But streaming algorithms still can mix in other mahout
components, which would make them quasi-mahout algorithms of second kind i
mentioned.

Re: Engine specific algos

Posted by Ted Dunning <te...@gmail.com>.

My own take is quite similar but a little different.

It would be great of Mahout components are *usable* from spark streaming.
 For instance, model evaluation or on-line clustering both might well fit
here.  Likewise all of our sequential math stuff could be useful.

Highly Spark specific stuff really does belong in MLib, however.  Even they
are liable to reject stuff that depends on spark streaming.  Not my
decision, however.

Finally, I don't view our mission as limited to the DSL work.  We should
also accept/generate useful applications of the DSL.




On Wed, Jun 18, 2014 at 2:09 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Let me try to re-word a little.
>
> Contributions we are accepting should have common parts with Mahout (let's
> not focus on whether they use or do not use Spark streaming).
>
> Does this sound more acceptable?
>
>
> On Wed, Jun 18, 2014 at 2:04 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> > OK, fair enough, for that matter no activity was grounds enough. But that
> > wasn’t really the question I asked and your answer below was not given in
> > the Jira, so...
> >
> > Are you suggesting that my questions can be read as statements, in the
> > name of “narrow(ing) our focus”?
> >
> >
> >
> > On Jun 18, 2014, at 1:37 PM, Sebastian Schelter <ss...@apache.org> wrote:
> >
> > I think rejecting that contribution is the right thing to do. I think its
> > very important to narrow our focus. Let us put our efforts into finishing
> > and polishing what we are working on right now.
> >
> > A big problem of the "old" mahout was that we set the barrier for
> > contributions too low and ended up with lots of non-integrated,
> hard-to-use
> > algorithms of varying quality.
> >
> > What is the problem with not accepting a contribution? We agreed with
> Andy
> > that this might be better suited for inclusion in Spark's codebase and I
> > think that was the right decision.
> >
> > -s
> >
> > On 06/18/2014 10:29 PM, Pat Ferrel wrote:
> > > Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming
> > random forests
> > >
> > >> Also, we don't have any mappings for Spark Streaming -- so if your
> > >> implementation heavily relies on Spark streaming, i think Spark itself
> > is
> > >> the right place for it to be a part of.
> > >
> > > We are discouraging engine specific work? Even dismissing Spark
> > Streaming as a whole?
> > >
> > >> As it stands we don't have purely (c) methods and indeed i believe
> these
> > >> methods may be totally engine-specific in which case mllib is one of
> > >> possibly good homes for them.
> > >
> > > Adherence to a specific incarnation of an engine-neutral DSL has become
> > a requirement for inclusion in Mahout? The current DSL cannot be
> extended?
> > Or it can’t be extended with engine specific ways? Or it can’t be
> extended
> > with Spark Streaming? I would have thought all of these things desirable
> > otherwise we are limiting ourselves to a subset of what an engine can do
> or
> > a subset of problems that the current DSL supports.
> > >
> > > I hope I’m misreading this but it looks like we just discourage a
> > contributor from adding post hadoop code in an interesting area to
> Mahout?
> > >
> >
> >
> >
>

Re: Engine specific algos

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Let me try to re-word a little.

Contributions we are accepting should have common parts with Mahout (let's
not focus on whether they use or do not use Spark streaming).

Does this sound more acceptable?


On Wed, Jun 18, 2014 at 2:04 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> OK, fair enough, for that matter no activity was grounds enough. But that
> wasn’t really the question I asked and your answer below was not given in
> the Jira, so...
>
> Are you suggesting that my questions can be read as statements, in the
> name of “narrow(ing) our focus”?
>
>
>
> On Jun 18, 2014, at 1:37 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
> I think rejecting that contribution is the right thing to do. I think its
> very important to narrow our focus. Let us put our efforts into finishing
> and polishing what we are working on right now.
>
> A big problem of the "old" mahout was that we set the barrier for
> contributions too low and ended up with lots of non-integrated, hard-to-use
> algorithms of varying quality.
>
> What is the problem with not accepting a contribution? We agreed with Andy
> that this might be better suited for inclusion in Spark's codebase and I
> think that was the right decision.
>
> -s
>
> On 06/18/2014 10:29 PM, Pat Ferrel wrote:
> > Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming
> random forests
> >
> >> Also, we don't have any mappings for Spark Streaming -- so if your
> >> implementation heavily relies on Spark streaming, i think Spark itself
> is
> >> the right place for it to be a part of.
> >
> > We are discouraging engine specific work? Even dismissing Spark
> Streaming as a whole?
> >
> >> As it stands we don't have purely (c) methods and indeed i believe these
> >> methods may be totally engine-specific in which case mllib is one of
> >> possibly good homes for them.
> >
> > Adherence to a specific incarnation of an engine-neutral DSL has become
> a requirement for inclusion in Mahout? The current DSL cannot be extended?
> Or it can’t be extended with engine specific ways? Or it can’t be extended
> with Spark Streaming? I would have thought all of these things desirable
> otherwise we are limiting ourselves to a subset of what an engine can do or
> a subset of problems that the current DSL supports.
> >
> > I hope I’m misreading this but it looks like we just discourage a
> contributor from adding post hadoop code in an interesting area to Mahout?
> >
>
>
>

Re: Engine specific algos

Posted by Pat Ferrel <pa...@occamsmachete.com>.

OK, fair enough, for that matter no activity was grounds enough. But that wasn’t really the question I asked and your answer below was not given in the Jira, so...

Are you suggesting that my questions can be read as statements, in the name of “narrow(ing) our focus”?

On Jun 18, 2014, at 1:37 PM, Sebastian Schelter <ss...@apache.org> wrote:

I think rejecting that contribution is the right thing to do. I think its very important to narrow our focus. Let us put our efforts into finishing and polishing what we are working on right now.

A big problem of the "old" mahout was that we set the barrier for contributions too low and ended up with lots of non-integrated, hard-to-use algorithms of varying quality.

What is the problem with not accepting a contribution? We agreed with Andy that this might be better suited for inclusion in Spark's codebase and I think that was the right decision.

-s

On 06/18/2014 10:29 PM, Pat Ferrel wrote:
> Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random forests
> 
>> Also, we don't have any mappings for Spark Streaming -- so if your
>> implementation heavily relies on Spark streaming, i think Spark itself is
>> the right place for it to be a part of.
> 
> We are discouraging engine specific work? Even dismissing Spark Streaming as a whole?
> 
>> As it stands we don't have purely (c) methods and indeed i believe these
>> methods may be totally engine-specific in which case mllib is one of
>> possibly good homes for them.
> 
> Adherence to a specific incarnation of an engine-neutral DSL has become a requirement for inclusion in Mahout? The current DSL cannot be extended? Or it can’t be extended with engine specific ways? Or it can’t be extended with Spark Streaming? I would have thought all of these things desirable otherwise we are limiting ourselves to a subset of what an engine can do or a subset of problems that the current DSL supports.
> 
> I hope I’m misreading this but it looks like we just discourage a contributor from adding post hadoop code in an interesting area to Mahout?
>

Re: Engine specific algos

Posted by Sebastian Schelter <ss...@apache.org>.

I think rejecting that contribution is the right thing to do. I think 
its very important to narrow our focus. Let us put our efforts into 
finishing and polishing what we are working on right now.

A big problem of the "old" mahout was that we set the barrier for 
contributions too low and ended up with lots of non-integrated, 
hard-to-use algorithms of varying quality.

What is the problem with not accepting a contribution? We agreed with 
Andy that this might be better suited for inclusion in Spark's codebase 
and I think that was the right decision.

-s

On 06/18/2014 10:29 PM, Pat Ferrel wrote:
> Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random forests
>
>> Also, we don't have any mappings for Spark Streaming -- so if your
>> implementation heavily relies on Spark streaming, i think Spark itself is
>> the right place for it to be a part of.
>
> We are discouraging engine specific work? Even dismissing Spark Streaming as a whole?
>
>> As it stands we don't have purely (c) methods and indeed i believe these
>> methods may be totally engine-specific in which case mllib is one of
>> possibly good homes for them.
>
> Adherence to a specific incarnation of an engine-neutral DSL has become a requirement for inclusion in Mahout? The current DSL cannot be extended? Or it can’t be extended with engine specific ways? Or it can’t be extended with Spark Streaming? I would have thought all of these things desirable otherwise we are limiting ourselves to a subset of what an engine can do or a subset of problems that the current DSL supports.
>
> I hope I’m misreading this but it looks like we just discourage a contributor from adding post hadoop code in an interesting area to Mahout?
>

Re: Engine specific algos

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I think contributions of (a) and (b) type are expected by contributions of
type (c) (i.e. nothing common parts with Mahout) has to be considered
specifically. Some contributions may not currently have common parts with
Mahout but may become having ones in the near future, i'd say we try to
accomodate it. However, in this case as it stands this algorithm seem to
have no common parts with what we do.

what we do right now is linear algebra abstractions; maybe some frames
later and i have also some ideas of probabilistic inference abstractions in
mind for the future.

However we don't plan to have any streaming abstractions and likely will
not have. I don't see anything wrong with it going to MLLib, this will make
Spark-based ML even more attractive. As long as Andy makes his work
available to the common good, I can't care less which Spark-friendly
package he places it into.

I also don't see the task of algorithm richness as the goal in itself. The
primary goal of our work is to make quick prototyping possible in algebraic
and probabilistic fitting, and possibly feature preps. This IMO is more
important than being a rigid collection of things. Again, i can only send
one to Julia's blog as the manifesto of this philosophy.

Yes, we want to be practically useful with some examples of end2end
pipelines, and for that we probably must package some common approaches;
but in the end, in production i am likely end up not using their exact
version but rather a customized one in some way, just like i don't end up
using exact mllib versions of algorithms.

Speaking of something tangible, i'd rather see our feature prep pipeline
standardized and abstracted from engines rather than acquire more methods
right now; that would validate a lot of what we do. If in a few months we
were able to put end2end demo starting with feature encoding, that'd be a
big deal to me.

On Wed, Jun 18, 2014 at 1:29 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random
> forests
>
> > Also, we don't have any mappings for Spark Streaming -- so if your
> > implementation heavily relies on Spark streaming, i think Spark itself is
> > the right place for it to be a part of.
>
> We are discouraging engine specific work? Even dismissing Spark Streaming
> as a whole?
>
> > As it stands we don't have purely (c) methods and indeed i believe these
> > methods may be totally engine-specific in which case mllib is one of
> > possibly good homes for them.
>
> Adherence to a specific incarnation of an engine-neutral DSL has become a
> requirement for inclusion in Mahout? The current DSL cannot be extended? Or
> it can’t be extended with engine specific ways? Or it can’t be extended
> with Spark Streaming? I would have thought all of these things desirable
> otherwise we are limiting ourselves to a subset of what an engine can do or
> a subset of problems that the current DSL supports.
>
> I hope I’m misreading this but it looks like we just discourage a
> contributor from adding post hadoop code in an interesting area to Mahout?
>
>