You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Joseph Bradley <jo...@databricks.com> on 2017/01/24 01:03:41 UTC

MLlib mission and goals

This thread is split off from the "Feedback on MLlib roadmap process
proposal" thread for discussing the high-level mission and goals for
MLlib.  I hope this thread will collect feedback and ideas, not necessarily
lead to huge decisions.

Copying from the previous thread:

*Seth:*
"""
I would love to hear some discussion on the higher level goal of Spark
MLlib (if this derails the original discussion, please let me know and we
can discuss in another thread). The roadmap does contain specific items
that help to convey some of this (ML parity with MLlib, model persistence,
etc...), but I'm interested in what the "mission" of Spark MLlib is. We
often see PRs for brand new algorithms which are sometimes rejected and
sometimes not. Do we aim to keep implementing more and more algorithms? Or
is our focus really, now that we have a reasonable library of algorithms,
to simply make the existing ones faster/better/more robust? Should we aim
to make interfaces that are easily extended for developers to easily
implement their own custom code (e.g. custom optimization libraries), or do
we want to restrict things to out-of-the box algorithms? Should we focus on
more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this
discussion may have happened, but I think it would be useful to either
revisit it or restate it here for some of the newer developers.
"""

*Mingjie:*
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past *t**rajectory*:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and
making the library more robust.

I agree with Seth that a few *immediate goals* are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

*In the future*, it's harder to say, but if I had to pick my top 2 items,
I'd list:

*(1) Making MLlib more extensible*
It will not be feasible to support a huge number of algorithms, so allowing
users to customize their ML on Spark workflows will be critical.  This is
IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and
we will need to make it easier for users to write their own algorithms and
packages to facilitate this.  Part of this could be allowing users to
customize existing algorithms with custom loss functions, etc.

*(2) Consistent improvements to core algorithms*
A less exciting but still very important item will be constantly improving
the core set of algorithms in MLlib. This could mean speed, scaling,
robustness, and usability for the few algorithms which cover 90% of use
cases.

There are plenty of other possibilities, and it will be great to hear the
community's thoughts!

Thanks,
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: MLlib mission and goals

Posted by Seth Hendrickson <se...@gmail.com>.

I agree with what Sean said about not supporting arbitrarily many
algorithms. I think the goal of MLlib should be to support only core
algorithms for machine learning. Ideally Spark ML provides a relatively
small set of algorithms that are heavily optimized, and also provides a
framework that makes it easy for users to extend and build their own
packages and algos when they need to. Spark ML is already quite good for
this. We have of course been doing a lot of work migrating to this new API,
and now that we are approaching full parity, it would be good to shift the
focus to performance as others have noted. Supporting a few algorithms that
perform very well is significantly better than supporting many algorithms
with moderate performance, IMO.

I also think a more complete, optimized distributed linear algebra library
would be a great asset, but it may be a more long term goal. A performance
framework for regression testing would be great, but keeping it up to date
is difficult.

Thanks for kicking this thread off Joseph!

On Tue, Jan 24, 2017 at 7:30 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> *Re: performance measurement framework*
> We (Databricks) used to use spark-perf
> <https://github.com/databricks/spark-perf>, but that was mainly for the
> RDD-based API.  We've now switched to spark-sql-perf
> <https://github.com/databricks/spark-sql-perf>, which does include some
> ML benchmarks despite the project name.  I'll see about updating the
> project README to document how to run MLlib tests.
>
>
> On Tue, Jan 24, 2017 at 6:02 PM, bradc <br...@oracle.com> wrote:
>
>> I believe one of the higher level goals of Spark MLlib should be to
>> improve the efficiency of the ML algorithms that already exist. Currently
>> there ML has a reasonable coverage of the important core algorithms. The
>> work to get to feature parity for DataFrame-based API and model persistence
>> are also important.
>>
>> Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead
>> of BLAS1 & BLAS3. For a long time we've used the concept of compute
>> intensity (compute_intensity = FP_operations/Word) to help look at the
>> performance of the underling compute kernels (see the papers referenced
>> below). It has been proven in many implementations that performance,
>> scalability, and huge reduction in memory pressure can be achieved by using
>> higher-level BLAS3 or LAPACK routines in both single node as well as
>> distributed computations.
>>
>> I performed a survey of some of Apache Spark's ML algorithms.
>> Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2
>> routines which have very low compute intensity. BLAS2 and BLAS1 routines
>> require a lot more memory bandwidth and will not achieve peak performance
>> on x86, GPUs, or any other processor.
>>
>> Apache Spark 2.1.0 ML routines & BLAS Routines
>>
>> ALS(Alternating Least Squares matrix factorization
>>
>>    - BLAS2: _SPR, _TPSV
>>    - BLAS1: _AXPY, _DOT, _SCAL, _NRM2
>>
>> Logistic regression classification
>>
>>    - BLAS2: _GEMV
>>    - BLAS1: _DOT, _SCAL
>>
>> Generalized linear regression
>>
>>    - BLAS1: _DOT
>>
>> Gradient-boosted tree regression
>>
>>    - BLAS1: _DOT
>>
>> GraphX SVD++
>>
>>    - BLAS1: _AXPY, _DOT,_SCAL
>>
>> Neural Net Multi-layer Perceptron
>>
>>    - BLAS3: _GEMM
>>    - BLAS2: _GEMV
>>
>> Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply
>> (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real,
>> 64-bit double, 32-bit complex, 64-bit complex operations; respectably).
>>
>> Refactoring the algorithms to use BLAS3 routines or higher level LAPACK
>> routines will require coding changes to use sub-block algorithms but the
>> performance benefits can be great.
>>
>> More at: https://blogs.oracle.com/BestPerf/entry/improving_algorithms
>> _in_spark_ml
>> Background:
>>
>> Brad Carlile. Parallelism, compute intensity, and data vectorization.
>> SuperComputing'93, November 1993.
>> <https://blogs.oracle.com/BestPerf/resource/Carlile-app_compute-intensity-1993.pdf>
>>
>> John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers
>> 1995
>> <https://www.researchgate.net/publication/213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers>
>>
>> ------------------------------
>> View this message in context: Re: MLlib mission and goals
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html>
>> Sent from the Apache Spark Developers List mailing list archive
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
>> Nabble.com.
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: MLlib mission and goals

Posted by Joseph Bradley <jo...@databricks.com>.

*Re: performance measurement framework*
We (Databricks) used to use spark-perf
<https://github.com/databricks/spark-perf>, but that was mainly for the
RDD-based API.  We've now switched to spark-sql-perf
<https://github.com/databricks/spark-sql-perf>, which does include some ML
benchmarks despite the project name.  I'll see about updating the project
README to document how to run MLlib tests.


On Tue, Jan 24, 2017 at 6:02 PM, bradc <br...@oracle.com> wrote:

> I believe one of the higher level goals of Spark MLlib should be to
> improve the efficiency of the ML algorithms that already exist. Currently
> there ML has a reasonable coverage of the important core algorithms. The
> work to get to feature parity for DataFrame-based API and model persistence
> are also important.
>
> Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead
> of BLAS1 & BLAS3. For a long time we've used the concept of compute
> intensity (compute_intensity = FP_operations/Word) to help look at the
> performance of the underling compute kernels (see the papers referenced
> below). It has been proven in many implementations that performance,
> scalability, and huge reduction in memory pressure can be achieved by using
> higher-level BLAS3 or LAPACK routines in both single node as well as
> distributed computations.
>
> I performed a survey of some of Apache Spark's ML algorithms.
> Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2
> routines which have very low compute intensity. BLAS2 and BLAS1 routines
> require a lot more memory bandwidth and will not achieve peak performance
> on x86, GPUs, or any other processor.
>
> Apache Spark 2.1.0 ML routines & BLAS Routines
>
> ALS(Alternating Least Squares matrix factorization
>
>    - BLAS2: _SPR, _TPSV
>    - BLAS1: _AXPY, _DOT, _SCAL, _NRM2
>
> Logistic regression classification
>
>    - BLAS2: _GEMV
>    - BLAS1: _DOT, _SCAL
>
> Generalized linear regression
>
>    - BLAS1: _DOT
>
> Gradient-boosted tree regression
>
>    - BLAS1: _DOT
>
> GraphX SVD++
>
>    - BLAS1: _AXPY, _DOT,_SCAL
>
> Neural Net Multi-layer Perceptron
>
>    - BLAS3: _GEMM
>    - BLAS2: _GEMV
>
> Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply
> (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real,
> 64-bit double, 32-bit complex, 64-bit complex operations; respectably).
>
> Refactoring the algorithms to use BLAS3 routines or higher level LAPACK
> routines will require coding changes to use sub-block algorithms but the
> performance benefits can be great.
>
> More at: https://blogs.oracle.com/BestPerf/entry/improving_
> algorithms_in_spark_ml
> Background:
>
> Brad Carlile. Parallelism, compute intensity, and data vectorization.
> SuperComputing'93, November 1993.
> <https://blogs.oracle.com/BestPerf/resource/Carlile-app_compute-intensity-1993.pdf>
>
> John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_
> Current_High_Performance_Computers 1995
> <https://www.researchgate.net/publication/213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers>
>
> ------------------------------
> View this message in context: Re: MLlib mission and goals
> <http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: MLlib mission and goals

Posted by bradc <br...@oracle.com>.

I believe one of the higher level goals of Spark MLlib should be to improve
the efficiency of the ML algorithms that already exist. Currently there ML
has a reasonable coverage of the important core algorithms. The work to get
to feature parity for DataFrame-based API and model persistence are also
important.
Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead of
BLAS1 & BLAS3. For a long time we've used the concept of compute intensity
(compute_intensity = FP_operations/Word) to help look at the performance of
the underling compute kernels (see the papers referenced below). It has
been proven in many implementations that performance, scalability, and huge
reduction in memory pressure can be achieved by using higher-level BLAS3 or
LAPACK routines in both single node as well as distributed computations.
I performed a survey of some of Apache Spark's ML algorithms. Unfortunately
most of the ML algorithms are implemented with BLAS1 or BLAS2 routines which
have very low compute intensity. BLAS2 and BLAS1 routines require a lot
more memory bandwidth and will not achieve peak performance on x86, GPUs, or
any other processor.
Apache Spark 2.1.0 ML routines & BLAS Routines
ALS(Alternating Least Squares matrix factorization
BLAS2: _SPR, _TPSV
BLAS1: _AXPY, _DOT, _SCAL, _NRM2
Logistic regression classification
BLAS2: _GEMV
BLAS1: _DOT, _SCAL
Generalized linear regression
BLAS1: _DOT
Gradient-boosted tree regression
BLAS1: _DOT
GraphX SVD++
BLAS1: _AXPY, _DOT,_SCAL
Neural Net Multi-layer Perceptron
BLAS3: _GEMM
BLAS2: _GEMV
Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply
(DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real,
64-bit double, 32-bit complex, 64-bit complex operations; respectably).
Refactoring the algorithms to use BLAS3 routines or higher level LAPACK
routines will require coding changes to use sub-block algorithms but the
performance benefits can be great.
More at:
https://blogs.oracle.com/BestPerf/entry/improving_algorithms_in_spark_ml
<https://blogs.oracle.com/BestPerf/entry/improving_algorithms_in_spark_ml>
Background:
Brad Carlile. Parallelism, compute intensity, and data vectorization.
SuperComputing'93, November 1993.
<https://blogs.oracle.com/BestPerf/resource/Carlile-app_compute-intensity-1993.pdf>
John McCalpin.
213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers
1995
<https://www.researchgate.net/publication/213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers>

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: MLlib mission and goals

Posted by Saikat Kanjilal <sx...@hotmail.com>.

In reading through this and thinking about usability is there any interest in building a performance measurement framework around some (or maybe all) of the ML/Lib algorithms, I envision this as something that can get run for each release build for our end users, it may be useful for internal ml devs to see what impact each change to their code has on performance, please pardon me if this already exists, am new to the codebase and contributing to spark.

________________________________
From: Asher Krim <ak...@hubspot.com>
Sent: Tuesday, January 24, 2017 12:17 PM
To: Miao Wang
Cc: javadba@gmail.com; dev@spark.apache.org; Sean Owen
Subject: Re: MLlib mission and goals

On the topic of usability, I think more effort should be put into large scale testing. We've encountered issues with building large models that are not apparent in small models, and these issues have made productizing ML/MLLIB much more difficult than we first anticipated. Considering that one of the biggest selling points for Spark is ease of scaling to large datasets, I think fleshing out SPARK-15573 and testing large models should be a priority

On Tue, Jan 24, 2017 at 2:23 PM, Miao Wang <wa...@us.ibm.com>> wrote:
I started working on ML/MLLIB/R since last year. Here are some of my thoughts from a beginner's perspective:

Current ML/MLLIB core algorithms can serve as good implementation examples, which makes adding new algorithms easier. Even a beginner like me, can pick it up quickly and learn how to add new algorithms. So, adding new algorithms should not be a barrier for developers who really need specific algorithms and it should not be the first priority in ML/MLLIB long term goal. We should only add highly demanded algorithms. I hope there will be detailed JIRA/email discussions to decide whether we want to accept a new algorithm.

I strongly agree that we should improve ML/MLLIB usability, stability and performance in core algorithms and the foundations such as linear algebra library etc. This will keep Spark ML/MLLIB competitive in the area of machine learning framework. For example, Microsoft just open source a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms. The performance and accuracy is much better than xboost. We need to follow up and improve Spark GBT alogrithms in near future.

Another related area is SparkR. API Parity between SparkR and ML/MLLIB is important. We should also pay attention to R users' habits and experiences when maintaining API parity.

Miao

----- Original message -----
From: Stephen Boesch <ja...@gmail.com>>
To: Sean Owen <so...@cloudera.com>>
Cc: "dev@spark.apache.org<ma...@spark.apache.org>" <de...@spark.apache.org>>
Subject: Re: MLlib mission and goals
Date: Tue, Jan 24, 2017 4:42 AM

re: spark-packages.org<http://spark-packages.org>  and "Would these really be better in the core project?"   That was not at all the intent of my input: instead to ask "how and where to structure/place deployment quality code that yet were *not* part of the distribution?"   The spark packages has no curation whatsoever : no minimum standards of code quality and deployment structures, let alone qualitative measures of usefulness.

While spark packages would never rival CRAN and friends there is not even any mechanism in place to get started.  From the CRAN site:

   Even at the current growth rate of several packages a day, all submissions are still rigorously quality-controlled using strong testing features available in the R system .

Maybe give something that has a subset of these processes a try ?  Different folks than are already over-subscribed in MLlib ?

2017-01-24 2:37 GMT-08:00 Sean Owen <so...@cloudera.com>>:
My $0.02, which shouldn't be weighted too much.

I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.

It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it.

The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.

That the bits on spark-packages.org<http://spark-packages.org> aren't so hot is not a problem but a symptom. Would these really be better in the core project?

And, or: I entirely agree with Joseph's take.

On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <jo...@databricks.com>> wrote:
This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.

Copying from the previous thread:

Seth:
"""
I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
"""

Mingjie:
"""
+1 general abstractions like distributed linear algebra.
"""

I'll add my thoughts, starting with our past trajectory:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and making the library more robust.

I agree with Seth that a few immediate goals are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:

(1) Making MLlib more extensible
It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.

(2) Consistent improvements to core algorithms
A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.

There are plenty of other possibilities, and it will be great to hear the community's thoughts!

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>

--------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

--
Asher Krim
Senior Software Engineer
[http://cdn2.hubspot.net/hub/137828/file-223457316-png/HubSpot_User_Group_Images/HUG_lrg_HS.png?t=1477096082917]

Re: MLlib mission and goals

Posted by Asher Krim <ak...@hubspot.com>.

On the topic of usability, I think more effort should be put into large
scale testing. We've encountered issues with building large models that are
not apparent in small models, and these issues have made productizing
ML/MLLIB much more difficult than we first anticipated. Considering that
one of the biggest selling points for Spark is ease of scaling to large
datasets, I think fleshing out SPARK-15573 and testing large models should
be a priority

On Tue, Jan 24, 2017 at 2:23 PM, Miao Wang <wa...@us.ibm.com> wrote:

> I started working on ML/MLLIB/R since last year. Here are some of my
> thoughts from a beginner's perspective:
>
> Current ML/MLLIB core algorithms can serve as good implementation
> examples, which makes adding new algorithms easier. Even a beginner like
> me, can pick it up quickly and learn how to add new algorithms. So, adding
> new algorithms should not be a barrier for developers who really need
> specific algorithms and it should not be the first priority in ML/MLLIB
> long term goal. We should only add highly demanded algorithms. I hope there
> will be detailed JIRA/email discussions to decide whether we want to accept
> a new algorithm.
>
> I strongly agree that we should improve ML/MLLIB usability, stability and
> performance in core algorithms and the foundations such as linear algebra
> library etc. This will keep Spark ML/MLLIB competitive in the area of
> machine learning framework. For example, Microsoft just open source a fast,
> distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART)
> framework based on decision tree algorithms. The performance and accuracy
> is much better than xboost. We need to follow up and improve Spark GBT
> alogrithms in near future.
>
> Another related area is SparkR. API Parity between SparkR and ML/MLLIB is
> important. We should also pay attention to R users' habits and experiences
> when maintaining API parity.
>
> Miao
>
>
> ----- Original message -----
> From: Stephen Boesch <ja...@gmail.com>
> To: Sean Owen <so...@cloudera.com>
> Cc: "dev@spark.apache.org" <de...@spark.apache.org>
> Subject: Re: MLlib mission and goals
> Date: Tue, Jan 24, 2017 4:42 AM
>
> re: spark-packages.org  and "Would these really be better in the core
> project?"   That was not at all the intent of my input: instead to ask "how
> and where to structure/place deployment quality code that yet were *not*
> part of the distribution?"   The spark packages has no curation whatsoever
> : no minimum standards of code quality and deployment structures, let alone
> qualitative measures of usefulness.
>
> While spark packages would never rival CRAN and friends there is not even
> any mechanism in place to get started.  From the CRAN site:
>
>    Even at the current growth rate of several packages a day, all
> submissions are still rigorously quality-controlled using strong testing
> features available in the R system .
>
> Maybe give something that has a subset of these processes a try ?
> Different folks than are already over-subscribed in MLlib ?
>
> 2017-01-24 2:37 GMT-08:00 Sean Owen <so...@cloudera.com>:
>
> My $0.02, which shouldn't be weighted too much.
>
> I believe the mission as of Spark ML has been to provide the framework,
> and then implementation of 'the basics' only. It should have the tools that
> cover ~80% of use cases, out of the box, in a pretty well-supported and
> tested way.
>
> It's not a goal to support an arbitrarily large collection of algorithms
> because each one adds marginally less value, and IMHO, is proportionally
> bigger baggage, because the contributors tend to skew academic, produce
> worse code, and don't stick around to maintain it.
>
> The project is already generally quite overloaded; I don't know if there's
> bandwidth to even cover the current scope. While 'the basics' is a
> subjective label, de facto, I think we'd have to define it as essentially
> "what we already have in place" for the foreseeable future.
>
> That the bits on spark-packages.org aren't so hot is not a problem but a
> symptom. Would these really be better in the core project?
>
> And, or: I entirely agree with Joseph's take.
>
> On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <jo...@databricks.com>
> wrote:
>
> This thread is split off from the "Feedback on MLlib roadmap process
> proposal" thread for discussing the high-level mission and goals for
> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
> lead to huge decisions.
>
> Copying from the previous thread:
>
> *Seth:*
> """
> I would love to hear some discussion on the higher level goal of Spark
> MLlib (if this derails the original discussion, please let me know and we
> can discuss in another thread). The roadmap does contain specific items
> that help to convey some of this (ML parity with MLlib, model persistence,
> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
> often see PRs for brand new algorithms which are sometimes rejected and
> sometimes not. Do we aim to keep implementing more and more algorithms? Or
> is our focus really, now that we have a reasonable library of algorithms,
> to simply make the existing ones faster/better/more robust? Should we aim
> to make interfaces that are easily extended for developers to easily
> implement their own custom code (e.g. custom optimization libraries), or do
> we want to restrict things to out-of-the box algorithms? Should we focus on
> more flexible, general abstractions like distributed linear algebra?
>
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
> """
>
> *Mingjie:*
> """
> +1 general abstractions like distributed linear algebra.
> """
>
>
> I'll add my thoughts, starting with our past *t**rajectory*:
> * Initially, MLlib was mainly trying to build a set of core algorithms.
> * Two years ago, the big effort was adding Pipelines.
> * In the last year, big efforts have been around completing Pipelines and
> making the library more robust.
>
> I agree with Seth that a few *immediate goals* are very clear:
> * feature parity for DataFrame-based API
> * completing and improving testing for model persistence
> * Python, R parity
>
> *In the future*, it's harder to say, but if I had to pick my top 2 items,
> I'd list:
>
> *(1) Making MLlib more extensible*
> It will not be feasible to support a huge number of algorithms, so
> allowing users to customize their ML on Spark workflows will be critical.
> This is IMO the most important thing we could do for MLlib.
> Part of this could be building a healthy community of Spark Packages, and
> we will need to make it easier for users to write their own algorithms and
> packages to facilitate this.  Part of this could be allowing users to
> customize existing algorithms with custom loss functions, etc.
>
> *(2) Consistent improvements to core algorithms*
> A less exciting but still very important item will be constantly improving
> the core set of algorithms in MLlib. This could mean speed, scaling,
> robustness, and usability for the few algorithms which cover 90% of use
> cases.
>
> There are plenty of other possibilities, and it will be great to hear the
> community's thoughts!
>
> Thanks,
> Joseph
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>
>
>
> --------------------------------------------------------------------- To
> unsubscribe e-mail: dev-unsubscribe@spark.apache.org




-- 
Asher Krim
Senior Software Engineer

Re: MLlib mission and goals

Posted by Stephen Boesch <ja...@gmail.com>.

re: spark-packages.org  and "Would these really be better in the core
project?"   That was not at all the intent of my input: instead to ask "how
and where to structure/place deployment quality code that yet were *not*
part of the distribution?"   The spark packages has no curation whatsoever
: no minimum standards of code quality and deployment structures, let alone
qualitative measures of usefulness.

While spark packages would never rival CRAN and friends there is not even
any mechanism in place to get started.  From the CRAN site:

   Even at the current growth rate of several packages a day, all
submissions are still rigorously quality-controlled using strong testing
features available in the R system .

Maybe give something that has a subset of these processes a try ?
Different folks than are already over-subscribed in MLlib ?

2017-01-24 2:37 GMT-08:00 Sean Owen <so...@cloudera.com>:

> My $0.02, which shouldn't be weighted too much.
>
> I believe the mission as of Spark ML has been to provide the framework,
> and then implementation of 'the basics' only. It should have the tools that
> cover ~80% of use cases, out of the box, in a pretty well-supported and
> tested way.
>
> It's not a goal to support an arbitrarily large collection of algorithms
> because each one adds marginally less value, and IMHO, is proportionally
> bigger baggage, because the contributors tend to skew academic, produce
> worse code, and don't stick around to maintain it.
>
> The project is already generally quite overloaded; I don't know if there's
> bandwidth to even cover the current scope. While 'the basics' is a
> subjective label, de facto, I think we'd have to define it as essentially
> "what we already have in place" for the foreseeable future.
>
> That the bits on spark-packages.org aren't so hot is not a problem but a
> symptom. Would these really be better in the core project?
>
> And, or: I entirely agree with Joseph's take.
>
>
> On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> This thread is split off from the "Feedback on MLlib roadmap process
>> proposal" thread for discussing the high-level mission and goals for
>> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
>> lead to huge decisions.
>>
>> Copying from the previous thread:
>>
>> *Seth:*
>> """
>> I would love to hear some discussion on the higher level goal of Spark
>> MLlib (if this derails the original discussion, please let me know and we
>> can discuss in another thread). The roadmap does contain specific items
>> that help to convey some of this (ML parity with MLlib, model persistence,
>> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
>> often see PRs for brand new algorithms which are sometimes rejected and
>> sometimes not. Do we aim to keep implementing more and more algorithms? Or
>> is our focus really, now that we have a reasonable library of algorithms,
>> to simply make the existing ones faster/better/more robust? Should we aim
>> to make interfaces that are easily extended for developers to easily
>> implement their own custom code (e.g. custom optimization libraries), or do
>> we want to restrict things to out-of-the box algorithms? Should we focus on
>> more flexible, general abstractions like distributed linear algebra?
>>
>> I was not involved in the project in the early days of MLlib when this
>> discussion may have happened, but I think it would be useful to either
>> revisit it or restate it here for some of the newer developers.
>> """
>>
>> *Mingjie:*
>> """
>> +1 general abstractions like distributed linear algebra.
>> """
>>
>>
>> I'll add my thoughts, starting with our past *t**rajectory*:
>> * Initially, MLlib was mainly trying to build a set of core algorithms.
>> * Two years ago, the big effort was adding Pipelines.
>> * In the last year, big efforts have been around completing Pipelines and
>> making the library more robust.
>>
>> I agree with Seth that a few *immediate goals* are very clear:
>> * feature parity for DataFrame-based API
>> * completing and improving testing for model persistence
>> * Python, R parity
>>
>> *In the future*, it's harder to say, but if I had to pick my top 2
>> items, I'd list:
>>
>> *(1) Making MLlib more extensible*
>> It will not be feasible to support a huge number of algorithms, so
>> allowing users to customize their ML on Spark workflows will be critical.
>> This is IMO the most important thing we could do for MLlib.
>> Part of this could be building a healthy community of Spark Packages, and
>> we will need to make it easier for users to write their own algorithms and
>> packages to facilitate this.  Part of this could be allowing users to
>> customize existing algorithms with custom loss functions, etc.
>>
>> *(2) Consistent improvements to core algorithms*
>> A less exciting but still very important item will be constantly
>> improving the core set of algorithms in MLlib. This could mean speed,
>> scaling, robustness, and usability for the few algorithms which cover 90%
>> of use cases.
>>
>> There are plenty of other possibilities, and it will be great to hear the
>> community's thoughts!
>>
>> Thanks,
>> Joseph
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>

Re: MLlib mission and goals

Posted by Jörn Franke <jo...@gmail.com>.

I also agree with Joseph and Sean.
With respect to spark-packages. I think the issue is that you have to manually add it, although it basically fetches the package from Maven Central (or custom upload).

From an organizational perspective there are other issues. E.g. You have to download it from the internet instead of using an artifact repository within the enterprise. You do not want users to download arbitrarily packages from the Internet into a production cluster. You also want to make sure that they do not use outdated or snapshot versions, that you have control over dependencies, licenses etc.

Currently I do not see that big artifact repository managers will support spark packages anytime soon. I also do not see it from the big Hadoop distributions.


> On 24 Jan 2017, at 11:37, Sean Owen <so...@cloudera.com> wrote:
> 
> My $0.02, which shouldn't be weighted too much.
> 
> I believe the mission as of Spark ML has been to provide the framework, and then implementation of 'the basics' only. It should have the tools that cover ~80% of use cases, out of the box, in a pretty well-supported and tested way.
> 
> It's not a goal to support an arbitrarily large collection of algorithms because each one adds marginally less value, and IMHO, is proportionally bigger baggage, because the contributors tend to skew academic, produce worse code, and don't stick around to maintain it. 
> 
> The project is already generally quite overloaded; I don't know if there's bandwidth to even cover the current scope. While 'the basics' is a subjective label, de facto, I think we'd have to define it as essentially "what we already have in place" for the foreseeable future.
> 
> That the bits on spark-packages.org aren't so hot is not a problem but a symptom. Would these really be better in the core project?
> 
> And, or: I entirely agree with Joseph's take.
> 
>> On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <jo...@databricks.com> wrote:
>> This thread is split off from the "Feedback on MLlib roadmap process proposal" thread for discussing the high-level mission and goals for MLlib.  I hope this thread will collect feedback and ideas, not necessarily lead to huge decisions.
>> 
>> Copying from the previous thread:
>> 
>> Seth:
>> """
>> I would love to hear some discussion on the higher level goal of Spark MLlib (if this derails the original discussion, please let me know and we can discuss in another thread). The roadmap does contain specific items that help to convey some of this (ML parity with MLlib, model persistence, etc...), but I'm interested in what the "mission" of Spark MLlib is. We often see PRs for brand new algorithms which are sometimes rejected and sometimes not. Do we aim to keep implementing more and more algorithms? Or is our focus really, now that we have a reasonable library of algorithms, to simply make the existing ones faster/better/more robust? Should we aim to make interfaces that are easily extended for developers to easily implement their own custom code (e.g. custom optimization libraries), or do we want to restrict things to out-of-the box algorithms? Should we focus on more flexible, general abstractions like distributed linear algebra?
>> 
>> I was not involved in the project in the early days of MLlib when this discussion may have happened, but I think it would be useful to either revisit it or restate it here for some of the newer developers.
>> """
>> 
>> Mingjie:
>> """
>> +1 general abstractions like distributed linear algebra.
>> """
>> 
>> 
>> I'll add my thoughts, starting with our past trajectory:
>> * Initially, MLlib was mainly trying to build a set of core algorithms.
>> * Two years ago, the big effort was adding Pipelines.
>> * In the last year, big efforts have been around completing Pipelines and making the library more robust.
>> 
>> I agree with Seth that a few immediate goals are very clear:
>> * feature parity for DataFrame-based API
>> * completing and improving testing for model persistence
>> * Python, R parity
>> 
>> In the future, it's harder to say, but if I had to pick my top 2 items, I'd list:
>> 
>> (1) Making MLlib more extensible
>> It will not be feasible to support a huge number of algorithms, so allowing users to customize their ML on Spark workflows will be critical.  This is IMO the most important thing we could do for MLlib.
>> Part of this could be building a healthy community of Spark Packages, and we will need to make it easier for users to write their own algorithms and packages to facilitate this.  Part of this could be allowing users to customize existing algorithms with custom loss functions, etc.
>> 
>> (2) Consistent improvements to core algorithms
>> A less exciting but still very important item will be constantly improving the core set of algorithms in MLlib. This could mean speed, scaling, robustness, and usability for the few algorithms which cover 90% of use cases.
>> 
>> There are plenty of other possibilities, and it will be great to hear the community's thoughts!
>> 
>> Thanks,
>> Joseph
>> 
>> -- 
>> Joseph Bradley
>> Software Engineer - Machine Learning
>> Databricks, Inc.
>>

Re: MLlib mission and goals

Posted by Sean Owen <so...@cloudera.com>.

My $0.02, which shouldn't be weighted too much.

I believe the mission as of Spark ML has been to provide the framework, and
then implementation of 'the basics' only. It should have the tools that
cover ~80% of use cases, out of the box, in a pretty well-supported and
tested way.

It's not a goal to support an arbitrarily large collection of algorithms
because each one adds marginally less value, and IMHO, is proportionally
bigger baggage, because the contributors tend to skew academic, produce
worse code, and don't stick around to maintain it.

The project is already generally quite overloaded; I don't know if there's
bandwidth to even cover the current scope. While 'the basics' is a
subjective label, de facto, I think we'd have to define it as essentially
"what we already have in place" for the foreseeable future.

That the bits on spark-packages.org aren't so hot is not a problem but a
symptom. Would these really be better in the core project?

And, or: I entirely agree with Joseph's take.

On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley <jo...@databricks.com>
wrote:

> This thread is split off from the "Feedback on MLlib roadmap process
> proposal" thread for discussing the high-level mission and goals for
> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
> lead to huge decisions.
>
> Copying from the previous thread:
>
> *Seth:*
> """
> I would love to hear some discussion on the higher level goal of Spark
> MLlib (if this derails the original discussion, please let me know and we
> can discuss in another thread). The roadmap does contain specific items
> that help to convey some of this (ML parity with MLlib, model persistence,
> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
> often see PRs for brand new algorithms which are sometimes rejected and
> sometimes not. Do we aim to keep implementing more and more algorithms? Or
> is our focus really, now that we have a reasonable library of algorithms,
> to simply make the existing ones faster/better/more robust? Should we aim
> to make interfaces that are easily extended for developers to easily
> implement their own custom code (e.g. custom optimization libraries), or do
> we want to restrict things to out-of-the box algorithms? Should we focus on
> more flexible, general abstractions like distributed linear algebra?
>
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
> """
>
> *Mingjie:*
> """
> +1 general abstractions like distributed linear algebra.
> """
>
>
> I'll add my thoughts, starting with our past *t**rajectory*:
> * Initially, MLlib was mainly trying to build a set of core algorithms.
> * Two years ago, the big effort was adding Pipelines.
> * In the last year, big efforts have been around completing Pipelines and
> making the library more robust.
>
> I agree with Seth that a few *immediate goals* are very clear:
> * feature parity for DataFrame-based API
> * completing and improving testing for model persistence
> * Python, R parity
>
> *In the future*, it's harder to say, but if I had to pick my top 2 items,
> I'd list:
>
> *(1) Making MLlib more extensible*
> It will not be feasible to support a huge number of algorithms, so
> allowing users to customize their ML on Spark workflows will be critical.
> This is IMO the most important thing we could do for MLlib.
> Part of this could be building a healthy community of Spark Packages, and
> we will need to make it easier for users to write their own algorithms and
> packages to facilitate this.  Part of this could be allowing users to
> customize existing algorithms with custom loss functions, etc.
>
> *(2) Consistent improvements to core algorithms*
> A less exciting but still very important item will be constantly improving
> the core set of algorithms in MLlib. This could mean speed, scaling,
> robustness, and usability for the few algorithms which cover 90% of use
> cases.
>
> There are plenty of other possibilities, and it will be great to hear the
> community's thoughts!
>
> Thanks,
> Joseph
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: MLlib mission and goals

Posted by Stephen Boesch <ja...@gmail.com>.

Along the lines of #1:  the spark packages seemed to have had a good start
about two years ago: but now there are not more than a handful in general
use - e.g. databricks CSV.
When the available packages are browsed the majority are incomplete, empty,
unmaintained, or unclear.

Any ideas on how to resurrect spark packages in a way that there will be
sufficient adoption for it to be meaningful?

2017-01-23 17:03 GMT-08:00 Joseph Bradley <jo...@databricks.com>:

> This thread is split off from the "Feedback on MLlib roadmap process
> proposal" thread for discussing the high-level mission and goals for
> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
> lead to huge decisions.
>
> Copying from the previous thread:
>
> *Seth:*
> """
> I would love to hear some discussion on the higher level goal of Spark
> MLlib (if this derails the original discussion, please let me know and we
> can discuss in another thread). The roadmap does contain specific items
> that help to convey some of this (ML parity with MLlib, model persistence,
> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
> often see PRs for brand new algorithms which are sometimes rejected and
> sometimes not. Do we aim to keep implementing more and more algorithms? Or
> is our focus really, now that we have a reasonable library of algorithms,
> to simply make the existing ones faster/better/more robust? Should we aim
> to make interfaces that are easily extended for developers to easily
> implement their own custom code (e.g. custom optimization libraries), or do
> we want to restrict things to out-of-the box algorithms? Should we focus on
> more flexible, general abstractions like distributed linear algebra?
>
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
> """
>
> *Mingjie:*
> """
> +1 general abstractions like distributed linear algebra.
> """
>
>
> I'll add my thoughts, starting with our past *t**rajectory*:
> * Initially, MLlib was mainly trying to build a set of core algorithms.
> * Two years ago, the big effort was adding Pipelines.
> * In the last year, big efforts have been around completing Pipelines and
> making the library more robust.
>
> I agree with Seth that a few *immediate goals* are very clear:
> * feature parity for DataFrame-based API
> * completing and improving testing for model persistence
> * Python, R parity
>
> *In the future*, it's harder to say, but if I had to pick my top 2 items,
> I'd list:
>
> *(1) Making MLlib more extensible*
> It will not be feasible to support a huge number of algorithms, so
> allowing users to customize their ML on Spark workflows will be critical.
> This is IMO the most important thing we could do for MLlib.
> Part of this could be building a healthy community of Spark Packages, and
> we will need to make it easier for users to write their own algorithms and
> packages to facilitate this.  Part of this could be allowing users to
> customize existing algorithms with custom loss functions, etc.
>
> *(2) Consistent improvements to core algorithms*
> A less exciting but still very important item will be constantly improving
> the core set of algorithms in MLlib. This could mean speed, scaling,
> robustness, and usability for the few algorithms which cover 90% of use
> cases.
>
> There are plenty of other possibilities, and it will be great to hear the
> community's thoughts!
>
> Thanks,
> Joseph
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>