You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Seth Hendrickson <se...@gmail.com> on 2017/02/01 00:15:54 UTC
Re: MLlib mission and goals

I agree with what Sean said about not supporting arbitrarily many
algorithms. I think the goal of MLlib should be to support only core
algorithms for machine learning. Ideally Spark ML provides a relatively
small set of algorithms that are heavily optimized, and also provides a
framework that makes it easy for users to extend and build their own
packages and algos when they need to. Spark ML is already quite good for
this. We have of course been doing a lot of work migrating to this new API,
and now that we are approaching full parity, it would be good to shift the
focus to performance as others have noted. Supporting a few algorithms that
perform very well is significantly better than supporting many algorithms
with moderate performance, IMO.

I also think a more complete, optimized distributed linear algebra library
would be a great asset, but it may be a more long term goal. A performance
framework for regression testing would be great, but keeping it up to date
is difficult.

Thanks for kicking this thread off Joseph!

On Tue, Jan 24, 2017 at 7:30 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> *Re: performance measurement framework*
> We (Databricks) used to use spark-perf
> <https://github.com/databricks/spark-perf>, but that was mainly for the
> RDD-based API.  We've now switched to spark-sql-perf
> <https://github.com/databricks/spark-sql-perf>, which does include some
> ML benchmarks despite the project name.  I'll see about updating the
> project README to document how to run MLlib tests.
>
>
> On Tue, Jan 24, 2017 at 6:02 PM, bradc <br...@oracle.com> wrote:
>
>> I believe one of the higher level goals of Spark MLlib should be to
>> improve the efficiency of the ML algorithms that already exist. Currently
>> there ML has a reasonable coverage of the important core algorithms. The
>> work to get to feature parity for DataFrame-based API and model persistence
>> are also important.
>>
>> Apache Spark needs to use higher-level BLAS3 and LAPACK routines, instead
>> of BLAS1 & BLAS3. For a long time we've used the concept of compute
>> intensity (compute_intensity = FP_operations/Word) to help look at the
>> performance of the underling compute kernels (see the papers referenced
>> below). It has been proven in many implementations that performance,
>> scalability, and huge reduction in memory pressure can be achieved by using
>> higher-level BLAS3 or LAPACK routines in both single node as well as
>> distributed computations.
>>
>> I performed a survey of some of Apache Spark's ML algorithms.
>> Unfortunately most of the ML algorithms are implemented with BLAS1 or BLAS2
>> routines which have very low compute intensity. BLAS2 and BLAS1 routines
>> require a lot more memory bandwidth and will not achieve peak performance
>> on x86, GPUs, or any other processor.
>>
>> Apache Spark 2.1.0 ML routines & BLAS Routines
>>
>> ALS(Alternating Least Squares matrix factorization
>>
>>    - BLAS2: _SPR, _TPSV
>>    - BLAS1: _AXPY, _DOT, _SCAL, _NRM2
>>
>> Logistic regression classification
>>
>>    - BLAS2: _GEMV
>>    - BLAS1: _DOT, _SCAL
>>
>> Generalized linear regression
>>
>>    - BLAS1: _DOT
>>
>> Gradient-boosted tree regression
>>
>>    - BLAS1: _DOT
>>
>> GraphX SVD++
>>
>>    - BLAS1: _AXPY, _DOT,_SCAL
>>
>> Neural Net Multi-layer Perceptron
>>
>>    - BLAS3: _GEMM
>>    - BLAS2: _GEMV
>>
>> Only the Neural Net Multi-layer Perceptron uses BLAS3 matrix multiply
>> (DGEMM). BTW the underscores are replaced by S, D, Z, C for (32-bit real,
>> 64-bit double, 32-bit complex, 64-bit complex operations; respectably).
>>
>> Refactoring the algorithms to use BLAS3 routines or higher level LAPACK
>> routines will require coding changes to use sub-block algorithms but the
>> performance benefits can be great.
>>
>> More at: https://blogs.oracle.com/BestPerf/entry/improving_algorithms
>> _in_spark_ml
>> Background:
>>
>> Brad Carlile. Parallelism, compute intensity, and data vectorization.
>> SuperComputing'93, November 1993.
>> <https://blogs.oracle.com/BestPerf/resource/Carlile-app_compute-intensity-1993.pdf>
>>
>> John McCalpin. 213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers
>> 1995
>> <https://www.researchgate.net/publication/213876927_Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers>
>>
>> ------------------------------
>> View this message in context: Re: MLlib mission and goals
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html>
>> Sent from the Apache Spark Developers List mailing list archive
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
>> Nabble.com.
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>