You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samoa.apache.org by Ted Dunning <td...@maprtech.com> on 2015/02/02 00:07:58 UTC

Re: Spark 1.2 adds streaming k-means

That isn't streaming k-means in the Mahout sense.  What they have done is
implement a very basic sort of exponential smoothing to the normal k-means
algorithm so that only recent points contribute significantly to centroid
location.  This assumes an initial high quality cluster and probably also
depends on small changes in the underlying data distribution.  It doesn't
solve the multi-start problem in high dimensions.

The Mahout algorithm is a bit different.  The idea is that you want to do a
single pass high quality clustering of a lot of data.  This is hard to do
with traditional k-means, both because k-means normally requires multiple
passes through the data to get good centroids and also because multiple
restarts are required to get good results.  A streaming solution should
also be able to give you an accurate clustering at any point in time with
roughly unit-ish cost.  All these problems are solved with the Mahout
solution.  The current problems with the Mahout solution have to do with
the fact that the map-reduce solution has poor scaling properties due to
the non-trivial size of the cluster sketches.

On Thu, Jan 29, 2015 at 7:24 AM, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:

> Seems they started to play with streaming algorithms also in Spark and
> MLlib.
>
> https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
>
> I wonder how much the mini-batch programming model they have fits
> traditional streaming algorithms.
> Also, I guess the concept of state across the stream does not fit very well
> the abstraction of RDDs.
>
> Interesting to read nevertheless.
>
> Cheers,
> --
> Gianmarco
>

Re: Spark 1.2 adds streaming k-means

Posted by Albert Bifet <ab...@waikato.ac.nz>.

There is a lot of work in stream clustering from two communities, the
streaming algorithmic community, and the machine learning/data stream
mining community.

There is a nice survey of these new methods in:

https://www.researchgate.net/publication/257132178_Data_Stream_Clustering_A_Survey

http://dl.acm.org/citation.cfm?id=2522981

In Apache SAMOA, CluStream is implemented as one of the first
state-of-the-art methods:

http://www.vldb.org/conf/2003/papers/S04P02.pdf

and it could be nice to have more methods implemented, as for example the
ones implemented in MOA.


On Mon, Feb 2, 2015 at 7:07 AM, Ted Dunning <td...@maprtech.com> wrote:

> That isn't streaming k-means in the Mahout sense.  What they have done is
> implement a very basic sort of exponential smoothing to the normal k-means
> algorithm so that only recent points contribute significantly to centroid
> location.  This assumes an initial high quality cluster and probably also
> depends on small changes in the underlying data distribution.  It doesn't
> solve the multi-start problem in high dimensions.
>
> The Mahout algorithm is a bit different.  The idea is that you want to do a
> single pass high quality clustering of a lot of data.  This is hard to do
> with traditional k-means, both because k-means normally requires multiple
> passes through the data to get good centroids and also because multiple
> restarts are required to get good results.  A streaming solution should
> also be able to give you an accurate clustering at any point in time with
> roughly unit-ish cost.  All these problems are solved with the Mahout
> solution.  The current problems with the Mahout solution have to do with
> the fact that the map-reduce solution has poor scaling properties due to
> the non-trivial size of the cluster sketches.
>
>
>
>
>
> On Thu, Jan 29, 2015 at 7:24 AM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
> > Seems they started to play with streaming algorithms also in Spark and
> > MLlib.
> >
> >
> https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
> >
> > I wonder how much the mini-batch programming model they have fits
> > traditional streaming algorithms.
> > Also, I guess the concept of state across the stream does not fit very
> well
> > the abstraction of RDDs.
> >
> > Interesting to read nevertheless.
> >
> > Cheers,
> > --
> > Gianmarco
> >
>