You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2017/02/01 00:06:36 UTC

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

On Tue, Jan 31, 2017 at 3:01 AM, Isabel Drost-Fromm <is...@apache.org>
wrote:

>
> Hi,
>
>
> To give some advise to downstream users in the field - what would be your
> advise
> for people tasked with concrete use cases (stuff like fraud detection,
> anomaly
> detection, learning search ranking functions, building a recommender
> system)?


If you are an off-the-shelf practitioner (most of smaller startup companies
without a chief scientist), with very few exceptions you might want to look
for an off-the-shelf solution where it exists, and most likely it does not
exist on Samsara in open domain. Except for a several applied
off-the-shelves, Mahout has not (hopefully just yet) developed a
comprehensive set of things to use.

The off-the-shelves currently are cross-occurrence recommendations (which
still require real time serving component taken from elsewhere), svd-pca,
some algebra, and Naive/complement Bayes at scale.

Most of the bigger companies i worked for never deal with completely the
off-the-shelf open source solutions. It always requires more understanding
of their problem. (E.g., much as COO recommender is wonderful, i don't
think Netflix would entertain taking Mahout's COO run on it verbatim).

It is quite common that companies invest in their own specific
understanding of their problem and requirements and a specific solution to
their problem through iterative experimentation with different
methodologies, most of which are either new-ish enough or proprietary
enough that public solution does not exist.

That latter case was pretty much motivation for Samsara. If you are a
practitioner solving numerical problems thru experimentation cycle, Mahout
is much more useful than any of the off-the-shelf collections.

So the idea, first, is to get R-like platform out for the practitioners,
and grow packages (just like with R). The platform obviously needs work
which unfortunately is not sufficiently sponsored imo at the moment by
industry or academia, compared to other projects.

  Is there even interest from users in such a use case based

> perspective? If so, would there be interest among the Mahout committers to
> help
> users publicly create docs/examples/modules to support these use cases?
>

yes


>
>
> Isabel
>
>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by "isabel@apache.org" <is...@apache.org>.
Hi,

On Wed, Feb 01, 2017 at 08:29:49PM +0000, Andrew Palumbo wrote:
> I think that https://issues.apache.org/jira/browse/MAHOUT-1856 ,  a solid
> framework for new algorithms will go A long way towards helping out new users
> understand how easy it is to add algorithms.  There has been significant work
> on this issue already merged to master with a fine OLS example including
> statistical tests for Autocorrelation and Heteroskedasticity.  Trevor G. has
> been heading up the framework effort, which is still in development, and will
> continue to be throughout 0.13.x releases (and hopefully added to in 0.14.x as
> well).

No need to be sorry here, from a first glimpse at the issue it does look useful.


> 
> I believe that having the framework in place will both make make Mahout More
> intuitive for new users and developers to write algorithms and pipelines, as
> well as to provide a set of canned algos to those who are looking for
> something off-the-shelf.

Are there things within or on top of that framework that interested users could
help out with?


> Just wanted to get that into the conversation.

Thank you for that.


Isabel


Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by Andrew Palumbo <ap...@outlook.com>.


________________________________
From: Isabel Drost <is...@apache.org>
Sent: Wednesday, February 1, 2017 4:55 AM
To: Dmitriy Lyubimov
Cc: user@mahout.apache.org
Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation



On Tue, Jan 31, 2017 at 04:06:36PM -0800, Dmitriy Lyubimov wrote:
> Except for a several applied
> off-the-shelves, Mahout has not (hopefully just yet) developed a
> comprehensive set of things to use.

Do you think there would be value in having that? Funding aside, would now be a
good time to develop that or do you think Samsara needs more work before
starting to work on that?

If there's value/ good timing: Do you think it would be possible to mentor
downstream users to help get this done? And a question to those still reading
this list: Would you be interested an able (time-wise) to help out here?


I'm sorry to cut in on the convorsation here, but I wanted people to be aware of the algorithm framework effort that is currently underway.

I think that https://issues.apache.org/jira/browse/MAHOUT-1856 ,  a solid framework for new algorithms will go A long way towards helping out new users understand how easy it is to add algorithms.  There has been significant work on this issue already merged to master with a fine OLS example including statistical tests for Autocorrelation and Heteroskedasticity.  Trevor G. has been heading up the framework effort, which is still in development, and will continue to be throughout 0.13.x releases (and hopefully added to in 0.14.x as well).

I believe that having the framework in place will both make make Mahout More intuitive for new users and developers to write algorithms and pipelines, as well as to provide a set of canned algos to those who are looking for something off-the-shelf.

Just wanted to get that into the conversation.


> The off-the-shelves currently are cross-occurrence recommendations (which
> still require real time serving component taken from elsewhere), svd-pca,
> some algebra, and Naive/complement Bayes at scale.
>
> Most of the bigger companies i worked for never deal with completely the
> off-the-shelf open source solutions. It always requires more understanding
> of their problem. (E.g., much as COO recommender is wonderful, i don't
> think Netflix would entertain taking Mahout's COO run on it verbatim).

Makes total sense to me. Would be possible to build a base system that performs
ok and can be extended such that is performs fantastically with a bit of extra
secret sauce?



> It is quite common that companies invest in their own specific
> understanding of their problem and requirements and a specific solution to
> their problem through iterative experimentation with different
> methodologies, most of which are either new-ish enough or proprietary
> enough that public solution does not exist.

While that does make a lot of sense, what I'm asking myself over and over is
this: Back when I was more active on this list there was a pattern in the
questions being asked. Often people were looking for recommenders, fraud
detection, event detection. Is there still such a pattern? If so it would be
interesting to think which of those problems are wide spread enough that
offering a standard package integrated from data ingestion to prediction would
make sense.


> That latter case was pretty much motivation for Samsara. If you are a
> practitioner solving numerical problems thru experimentation cycle, Mahout
> is much more useful than any of the off-the-shelf collections.

+1 This is also why I think focussing on Samsara and focussing on making that
stable and scalable makes a lot of sense.

The reason why I dug out this old thread comes from a slightly different angle:
We seem to have a solid base. But it's only really useful for a limited set of
experts. It will be hard to draw new contributors and committers from that set
of users (it will IMHO even be hard to find many users who are that skilled).
What I'm asking myself is if we should and can do something to make Mahout
useful for those who don't have that background.


> > perspective? If so, would there be interest among the Mahout committers to
> > help
> > users publicly create docs/examples/modules to support these use cases?
> >
>
> yes

Where do we start? ;)


Isabel



Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by Trevor Grant <tr...@gmail.com>.
Answers inline below.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, Feb 7, 2017 at 2:31 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:

> @Trevor Grant
>
> The landscape in machine learning is getting more and more diluted with
> lots of tools, here's a question, given that some folks are taking R and
> connecting it to spark and map reduce to make the R algorithms work at
> scale (https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler) what
> would be the additional value added in porting the R code using the
> algorithms/samsara framework, to me the MRS efforts and the approach you
> are proposing are 2 parallel tracks,


Correct, one is a commercial product by Microsoft- the other is a
business-friendly open source Apache Software Foundation Project.


> as far as the barriers to entry to contributing I think its largely due to
> the complexity of the codebase and the lack of familiarity with Samsara,

This  is what we hope to overcome with Algorithms framework and perhaps
more documentation.

I'd love to help create some good docs/tutorials on both the algorithms
> framework and samsara when and where it makes sense,

Would love the help- will be easier once we get migrated to Jekyll. (More
motivation to do this).


> however I feel like it'd be useful to really identify the use cases where
> using the algorithms/samsara approach has clear wins versus MRS

When you don't want to pay Microsoft to use your work in production.


> with spark or spark by itself or python/scikit-learn,

Out of scope for Mahout project, but I do have a talk forth coming that
will address this- stay tuned.


> I've found that in general people dont really need custom algorithms in
> datascience , they typically are answering some very basic classification
> or clustering question and can use linear/logistic regression or a variant
> of kmeans.

That has not been my experience.  In fact quite the opposite- most people
need more depth to their algorithms and many other big data ML packages
imply they have more depth than basic linear/logisitc regresion + kmeans,
but in fact that is all their is.  Not to say one is right or wrong- the
data scientists who are happy with simple tools can find them in
SparkML/FlinkML, those who need more advanced tools may turn to Mahout.


> I'd also like to help dig into some use cases with Samsara and put those
> use cases maybe in the examples section.
>
 Tutorials would be great- q.e.d. - more documentation would be helpful.


>
> Thoughts?
>
> ScaleR Functions - msdn.microsoft.com<https://msdn.microsoft.com/en-us/
> microsoft-r/scaler/scaler>
> msdn.microsoft.com
> The RevoScaleR package provides a set of over one hundred portable,
> scalable, and distributable data analysis functions. This topic presents a
> curated list ...
>
>
>
>
> ________________________________
> From: Trevor Grant <tr...@gmail.com>
> Sent: Tuesday, February 7, 2017 8:47 AM
> To: user@mahout.apache.org; isabel@apache.org
> Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation
>
> The idea that Andy briefly touched on, is that the Algorithm Framework
> (hopefully) paves the way for R/CRAN like user contribution.
>
> Increased contribution was a goal I had certainly hoped for.  I have begun
> promoting the idea at Meetups.  There hasn't been a concerted effort to
> push the idea, however it is a tagline / call to action I am planning on
> pushing at talks and conferences this spring. Thank you for raising the
> issue on the mailing list as well.
>
> Using the Samsara framework and "Algorithms" framework, it is hoped the the
> barrier to entry for new contributors will be very low, and that they can
> introduce new algorithms or port them from R. Other 'Big Data' Machine
> Learning frameworks suffer because they are not easily extensible.
>
> The algorithms framework makes it (more) clear where a new algorithm would
> go, and in general how it should behave. E.g. This is a Regressor, ok
> probably goes in the regressor package- it needs a fit method that takes a
> DrmX and a DrmY, and a predict method that takes DrmX and returns
> DrmY_hat).  The algorithms framework also provides a consistent interface
> across algorithms and puts up "guard rails" to ensure common things are
> done in an efficient manner (e.g. Serializing just the model, not the
> fitter and additional unneeded things, thank you Dmitriy). The Samsara
> framework makes it easy to 'read' what the person is doing. This makes it
> easier to review PRs, encourages community review, and if (hopefully not,
> but in case it does happen) someone makes a so-called 'drive by commit',
> that is commits an algorithm and is never heard of again, others can easily
> understand and maintain the algorithm in the persons absence.
>
> There are a number of issues labeled as beginner in JIRA now, especially
> with respect to the Algorithms package.
>
> It would probably be good to include a lot of this information in a web
> page either here https://mahout.apache.org/developers/how-to-contribute.
> html
> Apache Mahout: Scalable machine learning and data mining<
> https://mahout.apache.org/developers/how-to-contribute.html>
> mahout.apache.org
> How to contribute¶ Contributing to an Apache project is about more than
> just writing code -- it's about doing what you can to make the project
> better.
>
>
>
> or on a page that is linked to by that.
>
> Which leads me in to the last 'piece of the puzzle' I would like to have in
> place before aggressively advertising this as a "new-contributor friendly"
> project, migrating CMS to Jekyll
> https://issues.apache.org/jira/browse/MAHOUT-1933
>
> The rationale for that is so when new algorithms are submitted, the PR will
> include relevant documentation (as a convention) and that documentation can
> be corrected / expanded as needed in a more non-committer friendly manner.
>
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> [https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://
> github.com/rawkintrevo>
>
> rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
> github.com
> rawkintrevo has 22 repositories available. Follow their code on GitHub.
>
>
>
> http://stackexchange.com/users/3002022/rawkintrevo
> User rawkintrevo - Stack Exchange<http://stackexchange.
> com/users/3002022/rawkintrevo>
> stackexchange.com
> Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation
> activity favorites subscriptions. Top Questions
>
>
>
> http://trevorgrant.org
> [https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>
>
> The musings of rawkintrevo<http://trevorgrant.org/>
> trevorgrant.org
> Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.
>
>
>
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Tue, Feb 7, 2017 at 4:30 AM, Isabel Drost <is...@apache.org> wrote:
>
> > On Wed, Feb 01, 2017 at 03:32:24PM -0800, Dmitriy Lyubimov wrote:
> > > Isabel, if i understand it correctly, you are asking whether it makes
> > sense
> > > add end2end scenarios based on Samsara to current codebase?
> >
> > Sorry for being fuzzy. The meta question that I'm trying to find an
> answer
> > for
> > is if there's something can/ should be done to increase the number of
> > people
> > that potentially could be assimilated and turned into committers one day.
> > One
> > specific idea I had on my mind was to make the project easier to use for
> > beginners, one idea to get that accomplished I had was to focus on end to
> > end
> > implementations of popular use cases. (Sorry, fairly meta...)
> >
> >
> > > The answer is, absolutely. Yes it does for both rather isolated issues
> > > (like computing clusters) and end-2-end scenarios.
> > >
> > > The only problem with end 2 end scenarious is they often difficult to
> > > demonstrate with batch-oriented coputational system only. That's what
> > > prediction.io kind of picked on with COO, they included all of data
> > > ingestion, computation and real time scoring queries.
> > >
> > > But yes, there's, absolutely, tons of value in that. Not everything
> fits
> > > quite nicely, and not everything fits end-2-end (just like with R), but
> > > some fairly significant pieces do fit to be written on top.
> >
> > Makes sense.
> >
> >
> > > > Where do we start? ;)
> > > >
> > >
> > > I would start with figuring a problem I want to solve AND I have a
> budget
> > > to do it AND i can legally contribute on behalf of the IP owner.
> >
> > I guess given the meta explanation above - if increase in contributions
> > was a
> > goal one could also think about making potential areas of contribution
> > explicit
> > and highlight the value the project brings compared to other systems
> with a
> > specific focus on samsara. That's another angle of me asking weird
> > questions
> > here.
> >
> >
> > > Then we can think of whether it is a good fit (Samsara is mostly
> limited
> > to
> > > tensor based data only, just like Mapreduce DRM was/is). Some things
> may
> > > not have a convenient algebraic formulation.
> >
> > +1
> >
> > Isabel
> >
> > --
> > Sorry for any typos: Mail was typed in vim, written in mutt, via ssh
> (most
> > likely involving some kind of mobile connection only.)
> >
>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by Saikat Kanjilal <sx...@hotmail.com>.
@Trevor Grant

The landscape in machine learning is getting more and more diluted with lots of tools, here's a question, given that some folks are taking R and connecting it to spark and map reduce to make the R algorithms work at scale (https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler) what would be the additional value added in porting the R code using the algorithms/samsara framework, to me the MRS efforts and the approach you are proposing are 2 parallel tracks, as far as the barriers to entry to contributing I think its largely due to the complexity of the codebase and the lack of familiarity with Samsara, I'd love to help create some good docs/tutorials on both the algorithms framework and samsara when and where it makes sense, however I feel like it'd be useful to really identify the use cases where using the algorithms/samsara approach has clear wins versus MRS with spark or spark by itself or python/scikit-learn, I've found that in general people dont really need custom algorithms in datascience , they typically are answering some very basic classification or clustering question and can use linear/logistic regression or a variant of kmeans.   I'd also like to help dig into some use cases with Samsara and put those use cases maybe in the examples section.


Thoughts?

ScaleR Functions - msdn.microsoft.com<https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler>
msdn.microsoft.com
The RevoScaleR package provides a set of over one hundred portable, scalable, and distributable data analysis functions. This topic presents a curated list ...




________________________________
From: Trevor Grant <tr...@gmail.com>
Sent: Tuesday, February 7, 2017 8:47 AM
To: user@mahout.apache.org; isabel@apache.org
Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

The idea that Andy briefly touched on, is that the Algorithm Framework
(hopefully) paves the way for R/CRAN like user contribution.

Increased contribution was a goal I had certainly hoped for.  I have begun
promoting the idea at Meetups.  There hasn't been a concerted effort to
push the idea, however it is a tagline / call to action I am planning on
pushing at talks and conferences this spring. Thank you for raising the
issue on the mailing list as well.

Using the Samsara framework and "Algorithms" framework, it is hoped the the
barrier to entry for new contributors will be very low, and that they can
introduce new algorithms or port them from R. Other 'Big Data' Machine
Learning frameworks suffer because they are not easily extensible.

The algorithms framework makes it (more) clear where a new algorithm would
go, and in general how it should behave. E.g. This is a Regressor, ok
probably goes in the regressor package- it needs a fit method that takes a
DrmX and a DrmY, and a predict method that takes DrmX and returns
DrmY_hat).  The algorithms framework also provides a consistent interface
across algorithms and puts up "guard rails" to ensure common things are
done in an efficient manner (e.g. Serializing just the model, not the
fitter and additional unneeded things, thank you Dmitriy). The Samsara
framework makes it easy to 'read' what the person is doing. This makes it
easier to review PRs, encourages community review, and if (hopefully not,
but in case it does happen) someone makes a so-called 'drive by commit',
that is commits an algorithm and is never heard of again, others can easily
understand and maintain the algorithm in the persons absence.

There are a number of issues labeled as beginner in JIRA now, especially
with respect to the Algorithms package.

It would probably be good to include a lot of this information in a web
page either here https://mahout.apache.org/developers/how-to-contribute.html
Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/developers/how-to-contribute.html>
mahout.apache.org
How to contribute¶ Contributing to an Apache project is about more than just writing code -- it's about doing what you can to make the project better.



or on a page that is linked to by that.

Which leads me in to the last 'piece of the puzzle' I would like to have in
place before aggressively advertising this as a "new-contributor friendly"
project, migrating CMS to Jekyll
https://issues.apache.org/jira/browse/MAHOUT-1933

The rationale for that is so when new algorithms are submitted, the PR will
include relevant documentation (as a convention) and that documentation can
be corrected / expanded as needed in a more non-committer friendly manner.






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo>

rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.



http://stackexchange.com/users/3002022/rawkintrevo
User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo>
stackexchange.com
Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions



http://trevorgrant.org
[https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/>

The musings of rawkintrevo<http://trevorgrant.org/>
trevorgrant.org
Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons.




*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, Feb 7, 2017 at 4:30 AM, Isabel Drost <is...@apache.org> wrote:

> On Wed, Feb 01, 2017 at 03:32:24PM -0800, Dmitriy Lyubimov wrote:
> > Isabel, if i understand it correctly, you are asking whether it makes
> sense
> > add end2end scenarios based on Samsara to current codebase?
>
> Sorry for being fuzzy. The meta question that I'm trying to find an answer
> for
> is if there's something can/ should be done to increase the number of
> people
> that potentially could be assimilated and turned into committers one day.
> One
> specific idea I had on my mind was to make the project easier to use for
> beginners, one idea to get that accomplished I had was to focus on end to
> end
> implementations of popular use cases. (Sorry, fairly meta...)
>
>
> > The answer is, absolutely. Yes it does for both rather isolated issues
> > (like computing clusters) and end-2-end scenarios.
> >
> > The only problem with end 2 end scenarious is they often difficult to
> > demonstrate with batch-oriented coputational system only. That's what
> > prediction.io kind of picked on with COO, they included all of data
> > ingestion, computation and real time scoring queries.
> >
> > But yes, there's, absolutely, tons of value in that. Not everything fits
> > quite nicely, and not everything fits end-2-end (just like with R), but
> > some fairly significant pieces do fit to be written on top.
>
> Makes sense.
>
>
> > > Where do we start? ;)
> > >
> >
> > I would start with figuring a problem I want to solve AND I have a budget
> > to do it AND i can legally contribute on behalf of the IP owner.
>
> I guess given the meta explanation above - if increase in contributions
> was a
> goal one could also think about making potential areas of contribution
> explicit
> and highlight the value the project brings compared to other systems with a
> specific focus on samsara. That's another angle of me asking weird
> questions
> here.
>
>
> > Then we can think of whether it is a good fit (Samsara is mostly limited
> to
> > tensor based data only, just like Mapreduce DRM was/is). Some things may
> > not have a convenient algebraic formulation.
>
> +1
>
> Isabel
>
> --
> Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most
> likely involving some kind of mobile connection only.)
>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by Trevor Grant <tr...@gmail.com>.
The idea that Andy briefly touched on, is that the Algorithm Framework
(hopefully) paves the way for R/CRAN like user contribution.

Increased contribution was a goal I had certainly hoped for.  I have begun
promoting the idea at Meetups.  There hasn't been a concerted effort to
push the idea, however it is a tagline / call to action I am planning on
pushing at talks and conferences this spring. Thank you for raising the
issue on the mailing list as well.

Using the Samsara framework and "Algorithms" framework, it is hoped the the
barrier to entry for new contributors will be very low, and that they can
introduce new algorithms or port them from R. Other 'Big Data' Machine
Learning frameworks suffer because they are not easily extensible.

The algorithms framework makes it (more) clear where a new algorithm would
go, and in general how it should behave. E.g. This is a Regressor, ok
probably goes in the regressor package- it needs a fit method that takes a
DrmX and a DrmY, and a predict method that takes DrmX and returns
DrmY_hat).  The algorithms framework also provides a consistent interface
across algorithms and puts up "guard rails" to ensure common things are
done in an efficient manner (e.g. Serializing just the model, not the
fitter and additional unneeded things, thank you Dmitriy). The Samsara
framework makes it easy to 'read' what the person is doing. This makes it
easier to review PRs, encourages community review, and if (hopefully not,
but in case it does happen) someone makes a so-called 'drive by commit',
that is commits an algorithm and is never heard of again, others can easily
understand and maintain the algorithm in the persons absence.

There are a number of issues labeled as beginner in JIRA now, especially
with respect to the Algorithms package.

It would probably be good to include a lot of this information in a web
page either here https://mahout.apache.org/developers/how-to-contribute.html
or on a page that is linked to by that.

Which leads me in to the last 'piece of the puzzle' I would like to have in
place before aggressively advertising this as a "new-contributor friendly"
project, migrating CMS to Jekyll
https://issues.apache.org/jira/browse/MAHOUT-1933

The rationale for that is so when new algorithms are submitted, the PR will
include relevant documentation (as a convention) and that documentation can
be corrected / expanded as needed in a more non-committer friendly manner.






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, Feb 7, 2017 at 4:30 AM, Isabel Drost <is...@apache.org> wrote:

> On Wed, Feb 01, 2017 at 03:32:24PM -0800, Dmitriy Lyubimov wrote:
> > Isabel, if i understand it correctly, you are asking whether it makes
> sense
> > add end2end scenarios based on Samsara to current codebase?
>
> Sorry for being fuzzy. The meta question that I'm trying to find an answer
> for
> is if there's something can/ should be done to increase the number of
> people
> that potentially could be assimilated and turned into committers one day.
> One
> specific idea I had on my mind was to make the project easier to use for
> beginners, one idea to get that accomplished I had was to focus on end to
> end
> implementations of popular use cases. (Sorry, fairly meta...)
>
>
> > The answer is, absolutely. Yes it does for both rather isolated issues
> > (like computing clusters) and end-2-end scenarios.
> >
> > The only problem with end 2 end scenarious is they often difficult to
> > demonstrate with batch-oriented coputational system only. That's what
> > prediction.io kind of picked on with COO, they included all of data
> > ingestion, computation and real time scoring queries.
> >
> > But yes, there's, absolutely, tons of value in that. Not everything fits
> > quite nicely, and not everything fits end-2-end (just like with R), but
> > some fairly significant pieces do fit to be written on top.
>
> Makes sense.
>
>
> > > Where do we start? ;)
> > >
> >
> > I would start with figuring a problem I want to solve AND I have a budget
> > to do it AND i can legally contribute on behalf of the IP owner.
>
> I guess given the meta explanation above - if increase in contributions
> was a
> goal one could also think about making potential areas of contribution
> explicit
> and highlight the value the project brings compared to other systems with a
> specific focus on samsara. That's another angle of me asking weird
> questions
> here.
>
>
> > Then we can think of whether it is a good fit (Samsara is mostly limited
> to
> > tensor based data only, just like Mapreduce DRM was/is). Some things may
> > not have a convenient algebraic formulation.
>
> +1
>
> Isabel
>
> --
> Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most
> likely involving some kind of mobile connection only.)
>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by Isabel Drost <is...@apache.org>.
On Wed, Feb 01, 2017 at 03:32:24PM -0800, Dmitriy Lyubimov wrote:
> Isabel, if i understand it correctly, you are asking whether it makes sense
> add end2end scenarios based on Samsara to current codebase?

Sorry for being fuzzy. The meta question that I'm trying to find an answer for
is if there's something can/ should be done to increase the number of people
that potentially could be assimilated and turned into committers one day. One
specific idea I had on my mind was to make the project easier to use for
beginners, one idea to get that accomplished I had was to focus on end to end
implementations of popular use cases. (Sorry, fairly meta...)


> The answer is, absolutely. Yes it does for both rather isolated issues
> (like computing clusters) and end-2-end scenarios.
> 
> The only problem with end 2 end scenarious is they often difficult to
> demonstrate with batch-oriented coputational system only. That's what
> prediction.io kind of picked on with COO, they included all of data
> ingestion, computation and real time scoring queries.
> 
> But yes, there's, absolutely, tons of value in that. Not everything fits
> quite nicely, and not everything fits end-2-end (just like with R), but
> some fairly significant pieces do fit to be written on top.

Makes sense.


> > Where do we start? ;)
> >
> 
> I would start with figuring a problem I want to solve AND I have a budget
> to do it AND i can legally contribute on behalf of the IP owner.

I guess given the meta explanation above - if increase in contributions was a
goal one could also think about making potential areas of contribution explicit
and highlight the value the project brings compared to other systems with a
specific focus on samsara. That's another angle of me asking weird questions
here.


> Then we can think of whether it is a good fit (Samsara is mostly limited to
> tensor based data only, just like Mapreduce DRM was/is). Some things may
> not have a convenient algebraic formulation.

+1

Isabel

-- 
Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most likely involving some kind of mobile connection only.)

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Isabel, if i understand it correctly, you are asking whether it makes sense
add end2end scenarios based on Samsara to current codebase?

The answer is, absolutely. Yes it does for both rather isolated issues
(like computing clusters) and end-2-end scenarios.

The only problem with end 2 end scenarious is they often difficult to
demonstrate with batch-oriented coputational system only. That's what
prediction.io kind of picked on with COO, they included all of data
ingestion, computation and real time scoring queries.

But yes, there's, absolutely, tons of value in that. Not everything fits
quite nicely, and not everything fits end-2-end (just like with R), but
some fairly significant pieces do fit to be written on top.


>
> > > perspective? If so, would there be interest among the Mahout
> committers to
> > > help
> > > users publicly create docs/examples/modules to support these use cases?
> > >
> >
> > yes
>
> Where do we start? ;)
>

I would start with figuring a problem I want to solve AND I have a budget
to do it AND i can legally contribute on behalf of the IP owner.

Then we can think of whether it is a good fit (Samsara is mostly limited to
tensor based data only, just like Mapreduce DRM was/is). Some things may
not have a convenient algebraic formulation.

-d

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

Posted by Isabel Drost <is...@apache.org>.
On Tue, Jan 31, 2017 at 04:06:36PM -0800, Dmitriy Lyubimov wrote:
> Except for a several applied
> off-the-shelves, Mahout has not (hopefully just yet) developed a
> comprehensive set of things to use.

Do you think there would be value in having that? Funding aside, would now be a
good time to develop that or do you think Samsara needs more work before
starting to work on that?

If there's value/ good timing: Do you think it would be possible to mentor
downstream users to help get this done? And a question to those still reading
this list: Would you be interested an able (time-wise) to help out here?


> The off-the-shelves currently are cross-occurrence recommendations (which
> still require real time serving component taken from elsewhere), svd-pca,
> some algebra, and Naive/complement Bayes at scale.
> 
> Most of the bigger companies i worked for never deal with completely the
> off-the-shelf open source solutions. It always requires more understanding
> of their problem. (E.g., much as COO recommender is wonderful, i don't
> think Netflix would entertain taking Mahout's COO run on it verbatim).

Makes total sense to me. Would be possible to build a base system that performs
ok and can be extended such that is performs fantastically with a bit of extra
secret sauce?


> It is quite common that companies invest in their own specific
> understanding of their problem and requirements and a specific solution to
> their problem through iterative experimentation with different
> methodologies, most of which are either new-ish enough or proprietary
> enough that public solution does not exist.

While that does make a lot of sense, what I'm asking myself over and over is
this: Back when I was more active on this list there was a pattern in the
questions being asked. Often people were looking for recommenders, fraud
detection, event detection. Is there still such a pattern? If so it would be
interesting to think which of those problems are wide spread enough that
offering a standard package integrated from data ingestion to prediction would
make sense.


> That latter case was pretty much motivation for Samsara. If you are a
> practitioner solving numerical problems thru experimentation cycle, Mahout
> is much more useful than any of the off-the-shelf collections.

+1 This is also why I think focussing on Samsara and focussing on making that
stable and scalable makes a lot of sense.

The reason why I dug out this old thread comes from a slightly different angle:
We seem to have a solid base. But it's only really useful for a limited set of
experts. It will be hard to draw new contributors and committers from that set
of users (it will IMHO even be hard to find many users who are that skilled).
What I'm asking myself is if we should and can do something to make Mahout
useful for those who don't have that background.



> > perspective? If so, would there be interest among the Mahout committers to
> > help
> > users publicly create docs/examples/modules to support these use cases?
> >
> 
> yes

Where do we start? ;)


Isabel