You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nick Pentreath <ni...@gmail.com> on 2014/03/04 16:05:02 UTC

Fwd: [Scikit-learn-general] Spark+sklearn sprint outcome ?

Thought that Spark users may be interested in the outcome of the Spark /
scikit-learn sprint that happened last month just after Strata...


---------- Forwarded message ----------
From: Olivier Grisel <ol...@ensta.org>
Date: Fri, Feb 21, 2014 at 6:30 PM
Subject: Re: [Scikit-learn-general] Spark+sklearn sprint outcome ?
To: scikit-learn-general <sc...@lists.sourceforge.net>


2014-02-21 16:06 GMT+01:00 Eustache DIEMERT <eu...@diemert.fr>:
> Hi there,
>
> Could someone that attended the sprint send a rough summary ?
>
> I'd be particularly interested about the tested approaches, those that
> didn't work, those that seem promising and what the next steps could be
...

We started with a general discussion on PySpark. It naturally features
a Python wrapper to the mllib Scala distributed machine learning
library [1] that is optimized to work on Spark. However Python users
might still want to leverage existing numpy / scipy tools for some
workloads.

The main difficulty to use numpy-aware tools efficiently is that
Sparks presents the data to the workers as an iterator over a large,
possibly cluster-partitioned collection of elements called a RDD. If
used naively one would load individual rows (1D numpy arrays) as
elements of an RDD to represent the content of a 2D data matrix. This
is not efficient because of the communication overhead between scala
and python workers and because it prevent to do efficient BLAS
operations that involve several rows at a time such as BLAS DGEMM
calls via numpy.dot for instance.

So this first issue was tackled by writing a block_rdd helper function
[2] to concatenate a bunch rows (e.g. 1D numpy arrays or list of
Python dicts) as chunked 2D numpy arrays or pandas DataFrame
respectively.

This makes it possible to train linear model incrementally more
efficiently as done in [3]. Model averaging is done via a reduction
step.

We also discussed how we could make it easier to plot the distribution
of data stored in a RDD and came up with the idea of computing
histograms on the spark side while exposing it with the same API as
the numpy.histogram function [4].

Have a look at the tests [5] for basic usage examples of all of the above.

There is also some high level discussion of the scope of the project in [6].

[1] http://spark.incubator.apache.org/docs/latest/mllib-guide.html
[2] https://github.com/ogrisel/spylearn/blob/master/spylearn/block_rdd.py
[3] https://github.com/ogrisel/spylearn/blob/master/spylearn/linear_model.py
[4] https://github.com/ogrisel/spylearn/blob/master/spylearn/histogram.py
[5] https://github.com/ogrisel/spylearn/tree/master/test
[6] https://github.com/ogrisel/spylearn/issues/1

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general