You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2014/04/02 09:03:19 UTC

KMeans|| opinions

Considering porting implementation [1] and paper for KMeans || for
Bindings.

This seems like another method to map fairly nicely.

The problem I am contemplating is ||-initialization, and in particular,
centroid storage. That particular implementation assumes centroids could be
kept in memory in front.

(1) Question is, is it a dangerous idea. It doesn't seem like it
particularly is, since unlikely people would want more k>1e+6. Another
thing, centers seem to be passed in via closure attribute (i.e.
java-serialized array-backed matrix).However, with Bindings it is quite
possible to keep centers at the back as a matrix.

(2) obviously, LLoyd iterations are not terribly accurate. || and ++
versions mostly speed things up. Is there any better-than-LLoyd accuracy
preference?


[1]
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

Re: KMeans|| opinions

Posted by Pat Ferrel <pa...@occamsmachete.com>.

+1 for any Spark Kmeans with DRM as input, next on my wishlist.
 
On Apr 2, 2014, at 12:16 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

|| initialization seems promising along with perhaps ball k-means iterations


On Wed, Apr 2, 2014 at 12:03 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Considering porting implementation [1] and paper for KMeans || for
> Bindings.
> 
> This seems like another method to map fairly nicely.
> 
> The problem I am contemplating is ||-initialization, and in particular,
> centroid storage. That particular implementation assumes centroids could be
> kept in memory in front.
> 
> (1) Question is, is it a dangerous idea. It doesn't seem like it
> particularly is, since unlikely people would want more k>1e+6. Another
> thing, centers seem to be passed in via closure attribute (i.e.
> java-serialized array-backed matrix).However, with Bindings it is quite
> possible to keep centers at the back as a matrix.
> 
> (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> versions mostly speed things up. Is there any better-than-LLoyd accuracy
> preference?
> 
> 
> [1]
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> 
> 
>

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

|| initialization seems promising along with perhaps ball k-means iterations


On Wed, Apr 2, 2014 at 12:03 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Considering porting implementation [1] and paper for KMeans || for
> Bindings.
>
> This seems like another method to map fairly nicely.
>
> The problem I am contemplating is ||-initialization, and in particular,
> centroid storage. That particular implementation assumes centroids could be
> kept in memory in front.
>
> (1) Question is, is it a dangerous idea. It doesn't seem like it
> particularly is, since unlikely people would want more k>1e+6. Another
> thing, centers seem to be passed in via closure attribute (i.e.
> java-serialized array-backed matrix).However, with Bindings it is quite
> possible to keep centers at the back as a matrix.
>
> (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> versions mostly speed things up. Is there any better-than-LLoyd accuracy
> preference?
>
>
> [1]
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
>
>
>

Re: KMeans|| opinions

Posted by Maciej Mazur <ma...@gmail.com>.

Yes, I looked at this one. Picture shows the general idea, but this is a
very high level.
I'm rather interested in implementation.
Initial clusters are modified in ClusterIterator.iterateMR
CIMapper setup - It loads regular hdfs file or is it a cache, why? It looks
as if it's regular file. Isn't it a huge overhead to load this file at
setup? What's the reasonable size of this file?
Why everything is written during cleanup call? Is it used instead of
combiner?
General idea - During each iteration the output of the previous iteration
(reduce) is loadead at setup, then the model is updated and propagated
(cleanup) to reducers. What happens if one cluster is very large ~70% of
all elements in the data set? Cleanup helps with that?
Stop condition - isConverged - Does it compare outputs (2 files) from last
two iterations or is it encapsulated in Cluster class?

On Sun, Apr 13, 2014 at 4:32 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Did you check the website at https://mahout.apache.org/
> users/clustering/k-means-clustering.html ?
>
>
> On 04/13/2014 02:53 PM, Maciej Mazur wrote:
>
>> Recently I've been looking into K-means implementation.
>> I want to understand how it works, and why it was designed this way.
>> Could you give me some overview?
>> I see that during the setup clusters are read from the file. Is it a
>> distributed cache?  What's the maxmial size of this file, what's the
>> maximum value of k?
>> There is nothing outputed during the call of map function, everything is
>> saved at cleanup. Why?
>> Are there any docs concerning implementation?
>>
>> Thanks,
>> Maciej
>>
>>
>> On Wed, Apr 9, 2014 at 7:23 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>
>>> Well, you could view this as a performance bug in the implementation of
>>> the linear algebra.
>>>
>>> It certainly is, however, an odd interpretation of transpose.  I have
>>> used
>>> a similar trick in r to use sparse matrices as a counter but it always
>>> worried me a bit.
>>>
>>> Sent from my iPhone
>>>
>>>  On Apr 8, 2014, at 17:49, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> Problem is, I want to use linear algebra to handle that, not
>>>> combine().
>>>>
>>>
>>>
>>
>

Re: KMeans|| opinions

Posted by Sebastian Schelter <ss...@apache.org>.

Did you check the website at 
https://mahout.apache.org/users/clustering/k-means-clustering.html ?

On 04/13/2014 02:53 PM, Maciej Mazur wrote:
> Recently I've been looking into K-means implementation.
> I want to understand how it works, and why it was designed this way.
> Could you give me some overview?
> I see that during the setup clusters are read from the file. Is it a
> distributed cache?  What's the maxmial size of this file, what's the
> maximum value of k?
> There is nothing outputed during the call of map function, everything is
> saved at cleanup. Why?
> Are there any docs concerning implementation?
>
> Thanks,
> Maciej
>
>
> On Wed, Apr 9, 2014 at 7:23 AM, Ted Dunning <te...@gmail.com> wrote:
>
>>
>> Well, you could view this as a performance bug in the implementation of
>> the linear algebra.
>>
>> It certainly is, however, an odd interpretation of transpose.  I have used
>> a similar trick in r to use sparse matrices as a counter but it always
>> worried me a bit.
>>
>> Sent from my iPhone
>>
>>> On Apr 8, 2014, at 17:49, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Problem is, I want to use linear algebra to handle that, not
>>> combine().
>>
>

Re: KMeans|| opinions

Posted by Maciej Mazur <ma...@gmail.com>.

Recently I've been looking into K-means implementation.
I want to understand how it works, and why it was designed this way.
Could you give me some overview?
I see that during the setup clusters are read from the file. Is it a
distributed cache?  What's the maxmial size of this file, what's the
maximum value of k?
There is nothing outputed during the call of map function, everything is
saved at cleanup. Why?
Are there any docs concerning implementation?

Thanks,
Maciej

On Wed, Apr 9, 2014 at 7:23 AM, Ted Dunning <te...@gmail.com> wrote:

>
> Well, you could view this as a performance bug in the implementation of
> the linear algebra.
>
> It certainly is, however, an odd interpretation of transpose.  I have used
> a similar trick in r to use sparse matrices as a counter but it always
> worried me a bit.
>
> Sent from my iPhone
>
> > On Apr 8, 2014, at 17:49, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > Problem is, I want to use linear algebra to handle that, not
> > combine().
>

Re: KMeans|| opinions

Posted by Ted Dunning <te...@gmail.com>.

Well, you could view this as a performance bug in the implementation of the linear algebra. 

It certainly is, however, an odd interpretation of transpose.  I have used a similar trick in r to use sparse matrices as a counter but it always worried me a bit. 

Sent from my iPhone

> On Apr 8, 2014, at 17:49, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> Problem is, I want to use linear algebra to handle that, not
> combine().

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

should be, but they use groupByKey(). the output of that operator implies
full group with no intermediate product (i.e [K, Array[T]] type for
groups). I don't think it implies partial groups, as something like group
count would not work in that case as there's no way to reduce combiner
outputs with this semantics.  *I think* if they wanted to use combiners,
they would have to use combine() api with clear specification for what
output of combiner might be.

but your point is well-taken. one wouldn't need to use 1000 reducers to
meaningfully overcome the skew.  even if there is 1:1000 absolute count
skew after combiners, that still doesn't really qualify as much of a
reducer skew. A.t would eliminate even that unevenness. Which is not a big
deal. Problem is, I want to use linear algebra to handle that, not
combine(). Yeah, i suppose there are cases that fit linalg better and there
are ones that fit less. And then there are cases for which  we haven't
abstracted all the necessary toolset yet.

On Tue, Apr 8, 2014 at 3:24 PM, Ted Dunning <te...@gmail.com> wrote:

> On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > I suspect mllib code would suffer from non-determinsitic parallelism
> > behavior in its Lloyd iteration (as well as as memory overflow risk) in
> > certain corner case situations such as there are a lot of datapoints but
> > very few clusters sought. Spark (for the right reasons) doesn't believe
> in
> > sort-n-spill stuff which means centroid recomputation may suffer from
> > decreased or assymmetric parallelism after groupByKey call, especially if
> > cluster attribution counts end up heavily skewed for whatever reason.
> >
> > E.g. if you have 1 bln points and two centroids, it's my understanding
> that
> > post-attribution centroid reduction will create only two tasks each
> > processing no less than 500 mln attributed points, even if cluster
> capacity
> > quite adequately offers 500 tasks for this job.
> >
>
> This should be susceptible to combiner-like pre-aggregation.  That should
> give essentially perfect parallelism.
>

Re: KMeans|| opinions

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I suspect mllib code would suffer from non-determinsitic parallelism
> behavior in its Lloyd iteration (as well as as memory overflow risk) in
> certain corner case situations such as there are a lot of datapoints but
> very few clusters sought. Spark (for the right reasons) doesn't believe in
> sort-n-spill stuff which means centroid recomputation may suffer from
> decreased or assymmetric parallelism after groupByKey call, especially if
> cluster attribution counts end up heavily skewed for whatever reason.
>
> E.g. if you have 1 bln points and two centroids, it's my understanding that
> post-attribution centroid reduction will create only two tasks each
> processing no less than 500 mln attributed points, even if cluster capacity
> quite adequately offers 500 tasks for this job.
>

This should be susceptible to combiner-like pre-aggregation.  That should
give essentially perfect parallelism.

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

i only looked at dp-means. Yes i guess i mean dp-means with no further
adaptations. too little time, sorry.

On Tue, Apr 8, 2014 at 3:25 PM, Ted Dunning <te...@gmail.com> wrote:

> On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Regarding dpmeans: unlike ||,  the default code for dp-means does not
> > assume parallelization. Is there an adaptation to a shared-nothing
> > computation of that?
> >
>
> Do you mean real dp-means, or do you mean streaming k-means adaptation of
> dp-means?
>

Re: KMeans|| opinions

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Apr 8, 2014 at 8:17 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Regarding dpmeans: unlike ||,  the default code for dp-means does not
> assume parallelization. Is there an adaptation to a shared-nothing
> computation of that?
>

Do you mean real dp-means, or do you mean streaming k-means adaptation of
dp-means?

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I suspect mllib code would suffer from non-determinsitic parallelism
behavior in its Lloyd iteration (as well as as memory overflow risk) in
certain corner case situations such as there are a lot of datapoints but
very few clusters sought. Spark (for the right reasons) doesn't believe in
sort-n-spill stuff which means centroid recomputation may suffer from
decreased or assymmetric parallelism after groupByKey call, especially if
cluster attribution counts end up heavily skewed for whatever reason.

E.g. if you have 1 bln points and two centroids, it's my understanding that
post-attribution centroid reduction will create only two tasks each
processing no less than 500 mln attributed points, even if cluster capacity
quite adequately offers 500 tasks for this job.

I have a sketch of Lloyd iteration in Spark Bindings that computes centroid
matrix in backend in a way that is free of the aforementioned limitation.
The trick is that transposition is in fact an aggregate operation as well
in case vector keys contain duplicates. So all it needs is to re-key points
according to their attribution, and then perform transpose. Point counts
are kept in the first position of attributed vectors, so they would
aggregate correctly as well as vector sums. [1]

Despite that improvement, my implementation still suffers from falure to
scale for k (i.e. mapBlock closure running time will be O(k) instead of
ideally wanted O(1)).

Regarding dpmeans: unlike ||,  the default code for dp-means does not
assume parallelization. Is there an adaptation to a shared-nothing
computation of that?

[1]
/**
   *
   * @param drmData
   * @param inCoreC -- matrix with columns representing current cluster
centers
   * @return updated centroid matrix
   */
  private def lLoydIteration[K: ClassTag](drmData: DrmLike[K], inCoreC:
Matrix, dfunc: DistFunc = eucl):
    DrmLike[Int] = {

    // TODO: this is still tightly coupled to Spark
    implicit val sc = drmData.rdd.sparkContext

    val n = drmData.ncol
    val numCentroids = inCoreC.ncol
    val bcastC = drmBroadcast(inCoreC)

    val drmCentroids = drmData

        // Output assigned centroidi as a key for each row point.
        .mapBlock(ncol = n + 1) {

      case (keys, block) =>

        val assignedCentroids = for (r <- 0 until block.nrow) yield {
          val row = block(r, ::)
          val inCoreC = bcastC: Matrix

          // Find closest centroid -- TODO: this needs triangle inequality
in case of euclidean
          // distance
          (0 until numCentroids).view.map(ci => dfunc(row, inCoreC(::,
ci))).minValueIndex()
        }

        // Grow block with vector of 1.0 on the left -- this will serve as
denominator of new
        // centroid data to accumulate counts
        val newBlock = block.like(block.nrow, n + 1)
        newBlock(::, 0) := 1.0
        newBlock(::, 1 to n) := block

        assignedCentroids.toArray -> block
    }

        // Transposition aggregates vectors by centroid keys
        .t(::, 0 until numCentroids)

        .checkpoint()

    // collect centroid sizes into a vector and re-broadcast
    val vCSizes = drmBroadcast(drmCentroids(0 to 0, ::).collect(0, ::))

    // Finalize centroid matrix. First, slice off the counter row
    drmCentroids(1 to n, ::)

        // Then divide each centroid sum by its count, thus getting center.
        .mapBlock() {
      case (keys, block) =>
        for (col <- 0 until numCentroids) block(::, col) /=
vCSizes.value(col)
        keys -> block
    }

  }

On Wed, Apr 2, 2014 at 12:03 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Considering porting implementation [1] and paper for KMeans || for
> Bindings.
>
> This seems like another method to map fairly nicely.
>
> The problem I am contemplating is ||-initialization, and in particular,
> centroid storage. That particular implementation assumes centroids could be
> kept in memory in front.
>
> (1) Question is, is it a dangerous idea. It doesn't seem like it
> particularly is, since unlikely people would want more k>1e+6. Another
> thing, centers seem to be passed in via closure attribute (i.e.
> java-serialized array-backed matrix).However, with Bindings it is quite
> possible to keep centers at the back as a matrix.
>
> (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> versions mostly speed things up. Is there any better-than-LLoyd accuracy
> preference?
>
>
> [1]
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
>
>
>

Re: KMeans|| opinions

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Apr 4, 2014 at 5:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > Pipe-pipe is an initialization algorithm which (roughly) parallelizes
> > something like the ++ algorithm and requires logarithmically many
> parallel
> > passes over the data as opposed to the k passes required for ++.
> >
>
> i'll re-read it again. The way i understood the paper (and the code),
> pipe-pipe  actually uniformly subsamples data in each pass to bring in a
> limited number of points in each pass which is O(k) not O(n). This is not
> the same as going over each point computing distance-based probability
> metrics for them, if that's what you imply by going for ++ comparison.
>

I don't understand your point here.  My original point was that || makes
logarithmic number (log N or log k, I don't know which) of passes over the
data (direct quote from the paper) while ++ makes O(k) passes over the
data.  Such passes are often I/O bound.

Each ++ pass is made with the goal of selecting a new initial point.  This
is a sampling procedure based on all the previous initial points.  This
sounds very much like || except for the number of passes.

So my point is O(log mumble) versus O(k).  You say that pipe-pipe samples
data.  You say that is different than ++ sampling data.

Is it the uniform versus non-uniform sampling that you are talking about?
 That hardly seems important compared to the difference in number of passes.

DP-means would have to do subsampling as well.
>

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Apr 4, 2014 at 8:44 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
>
> On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> Pipe-pipe is an initialization algorithm which (roughly) parallelizes
>> something like the ++ algorithm and requires logarithmically many parallel
>> passes over the data as opposed to the k passes required for ++.
>>
>
> i'll re-read it again. The way i understood the paper (and the code),
> pipe-pipe  actually uniformly subsamples data in each pass to bring in a
> limited number of points in each pass which is O(k) not O(n). This is not
> the same as going over each point computing distance-based probability
> metrics for them, if that's what you imply by going for ++ comparison.
> DP-means would have to do subsampling as well.
>
Retracting that. They do full passes over data with full point cost
recomputations each time.

>
>
>> DP-means itself is a variant of k-means where the value of k is not fixed.
>>  Instead, whenever a data-point is further than a threshold \lambda from
>> the nearest centroid, it is used to form a new centroid.  Different
>> choices
>> of \lambda result in different numbers of clusters.  I don't know how well
>> DP-means by itself would work as an initialization pass.
>>
>> Streaming k-means uses a variant of dp-means to get a sketch of the data.
>>  This sketch has many more centroids than the final desired number of
>> clusters k.  As such, we don't have to worry much about the quality of the
>> initialization of this first dp-means pass.  The sketch is then clustered
>> using something like k-means++ and ball k-means iteration.
>>
>> As I see it, having a good streaming k-means block should make any
>> clustering algorithm better as long as the sketch is much smaller than the
>> original data.
>>
>> The second phase of clustering after streaming k-means can benefit from
>> any
>> technique that makes clustering better.  The caveat is just that streaming
>> k-means is likely to produce a small enough sketch that parallel
>> clustering
>> may not help very much.  Exactly where that trade-off applies depends a
>> lot
>> on the parallel framework available. Spark and h2o have much lower task
>> initiation costs so they will definitely be able to get parallel speedup
>> on
>> much smaller datasets.
>>
>>
>>
>>
>> On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>> > is there any opinion whether one of pipe-pipe (||), dp-means
>> > initializations has bona-fide advantages  one over another?
>> >
>> >
>> > On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <dl...@gmail.com>
>> > > wrote:
>> > >
>> > > > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <ted.dunning@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > - Have you considered sketch based algorithms?
>> > > > >
>> > > >
>> > > > can you give me a reference. at this point i am just  contemplating
>> > more
>> > > or
>> > > > less shameless port of what they've done in mllib.
>> > > >
>> > >
>> > > Here is the reference I used:
>> > >
>> > >
>> > >
>> >
>> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
>> > >
>> > > *A quick summary:*
>> > >
>> > > - a single pass over the data with a sort of dp-means [1] algorithm
>> > using a
>> > > very fast approximate centroid search can give you k log N centroids.
>> >  This
>> > > is the sketch.
>> > >
>> > > - clustering these centroids as weighted values gives a provably
>> probably
>> > > good clustering of the original data for well clusterable data.  For
>> real
>> > > data, it seems to work exactly as advertised.
>> > >
>> > > - clustering the sketch using ball k-means with clever initialization
>> is
>> > > important.  Good initialization is very expensive in terms of number
>> of
>> > > points so using a sketch is a really good idea.
>> > >
>> > > - the proofs all depend on Euclidean distance.
>> > >
>> > > - the threshold can start very small (what small means is determined
>> > > empirically).  Each time we wind up with too many centroids,
>> recursively
>> > > clustering the centroids and possibly increasing the threshold will
>> keep
>> > > the number reasonably bounded.
>> > >
>> > > *Some notes from practical experience:*
>> > >
>> > > - moderate levels of parallelism are easy since sketches can be merged
>> > > using set union.  You may want to recursively cluster the centroids at
>> > this
>> > > point if you have too many.  This is very nice application of
>> map-reduce.
>> > >
>> > > - high degrees of parallelism require multiple levels of
>> merge/collapse
>> > > since otherwise you wind up with a sketch nearly as large as the
>> original
>> > > data.  If you have m parallel clusterings then m k log (N/m) can be
>> > large.
>> > >  Take a billion points and m = 1000 and k = 10,000.  The size of the
>> > > parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
>> > 100, k
>> > > = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
>> > >
>> > > - ball k-means on highly clusterable data often uses a cutoff for
>> > centroid
>> > > computation of 0.5 x distance to nearest.  I find that with real data
>> > that
>> > > 1 x distance or even larger to nearest is much better.  The good
>> effects
>> > > are still mostly there, but you don't need wonderful data to succeed
>> > >
>> > > - the ball k-means algorithm that we have in Mahout is pretty high
>> > quality,
>> > > but could use a bit of triangle (Elkan [2] ) speedups.
>> > >
>> > > *References*
>> > >
>> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>> > > [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
>> > >
>> > >
>> > > >
>> > > > >
>> > > > > - It can be important to use optimizations in the search for
>> nearest
>> > > > > centroid.  Consider triangle optimizations.
>> > > > >
>> > > > > - Do you mean "parallel" when you type || or is there another
>> meaning
>> > > > > there?
>> > > > >
>> > > >
>> > > > No, i mean method called "kmeans||". It's unfortunate name since I
>> > really
>> > > > don't know how to make google to search for it.
>> > > >
>> > > > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
>> > > >
>> > > > >
>> > > > > - When you say ++ initialization, many people get this wrong and
>> > assume
>> > > > > that you mean pick the furthest point.  Getting really good
>> > > > initialization
>> > > > > is fairly difficult and typically requires more time than the
>> actual
>> > > > > clustering.  This is one of the key benefits of sketch based
>> methods.
>> > > > >
>> > > > > - Most algorithms require multiple restarts.  At higher dimension
>> the
>> > > > > number of restarts required becomes very large.  An ideal
>> > > implementation
>> > > > > does parallel sketch extraction followed by parallel ball k-means
>> for
>> > > > > restarts.
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <
>> dlieu.7@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Considering porting implementation [1] and paper for KMeans ||
>> for
>> > > > > > Bindings.
>> > > > > >
>> > > > > > This seems like another method to map fairly nicely.
>> > > > > >
>> > > > > > The problem I am contemplating is ||-initialization, and in
>> > > particular,
>> > > > > > centroid storage. That particular implementation assumes
>> centroids
>> > > > could
>> > > > > be
>> > > > > > kept in memory in front.
>> > > > > >
>> > > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it
>> > > > > > particularly is, since unlikely people would want more k>1e+6.
>> > > Another
>> > > > > > thing, centers seem to be passed in via closure attribute (i.e.
>> > > > > > java-serialized array-backed matrix).However, with Bindings it
>> is
>> > > quite
>> > > > > > possible to keep centers at the back as a matrix.
>> > > > > >
>> > > > > > (2) obviously, LLoyd iterations are not terribly accurate. ||
>> and
>> > ++
>> > > > > > versions mostly speed things up. Is there any better-than-LLoyd
>> > > > accuracy
>> > > > > > preference?
>> > > > > >
>> > > > > >
>> > > > > > [1]
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning <te...@gmail.com> wrote:

> Pipe-pipe is an initialization algorithm which (roughly) parallelizes
> something like the ++ algorithm and requires logarithmically many parallel
> passes over the data as opposed to the k passes required for ++.
>

i'll re-read it again. The way i understood the paper (and the code),
pipe-pipe  actually uniformly subsamples data in each pass to bring in a
limited number of points in each pass which is O(k) not O(n). This is not
the same as going over each point computing distance-based probability
metrics for them, if that's what you imply by going for ++ comparison.
DP-means would have to do subsampling as well.


> DP-means itself is a variant of k-means where the value of k is not fixed.
>  Instead, whenever a data-point is further than a threshold \lambda from
> the nearest centroid, it is used to form a new centroid.  Different choices
> of \lambda result in different numbers of clusters.  I don't know how well
> DP-means by itself would work as an initialization pass.
>
> Streaming k-means uses a variant of dp-means to get a sketch of the data.
>  This sketch has many more centroids than the final desired number of
> clusters k.  As such, we don't have to worry much about the quality of the
> initialization of this first dp-means pass.  The sketch is then clustered
> using something like k-means++ and ball k-means iteration.
>
> As I see it, having a good streaming k-means block should make any
> clustering algorithm better as long as the sketch is much smaller than the
> original data.
>
> The second phase of clustering after streaming k-means can benefit from any
> technique that makes clustering better.  The caveat is just that streaming
> k-means is likely to produce a small enough sketch that parallel clustering
> may not help very much.  Exactly where that trade-off applies depends a lot
> on the parallel framework available. Spark and h2o have much lower task
> initiation costs so they will definitely be able to get parallel speedup on
> much smaller datasets.
>
>
>
>
> On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > is there any opinion whether one of pipe-pipe (||), dp-means
> > initializations has bona-fide advantages  one over another?
> >
> >
> > On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > - Have you considered sketch based algorithms?
> > > > >
> > > >
> > > > can you give me a reference. at this point i am just  contemplating
> > more
> > > or
> > > > less shameless port of what they've done in mllib.
> > > >
> > >
> > > Here is the reference I used:
> > >
> > >
> > >
> >
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
> > >
> > > *A quick summary:*
> > >
> > > - a single pass over the data with a sort of dp-means [1] algorithm
> > using a
> > > very fast approximate centroid search can give you k log N centroids.
> >  This
> > > is the sketch.
> > >
> > > - clustering these centroids as weighted values gives a provably
> probably
> > > good clustering of the original data for well clusterable data.  For
> real
> > > data, it seems to work exactly as advertised.
> > >
> > > - clustering the sketch using ball k-means with clever initialization
> is
> > > important.  Good initialization is very expensive in terms of number of
> > > points so using a sketch is a really good idea.
> > >
> > > - the proofs all depend on Euclidean distance.
> > >
> > > - the threshold can start very small (what small means is determined
> > > empirically).  Each time we wind up with too many centroids,
> recursively
> > > clustering the centroids and possibly increasing the threshold will
> keep
> > > the number reasonably bounded.
> > >
> > > *Some notes from practical experience:*
> > >
> > > - moderate levels of parallelism are easy since sketches can be merged
> > > using set union.  You may want to recursively cluster the centroids at
> > this
> > > point if you have too many.  This is very nice application of
> map-reduce.
> > >
> > > - high degrees of parallelism require multiple levels of merge/collapse
> > > since otherwise you wind up with a sketch nearly as large as the
> original
> > > data.  If you have m parallel clusterings then m k log (N/m) can be
> > large.
> > >  Take a billion points and m = 1000 and k = 10,000.  The size of the
> > > parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
> > 100, k
> > > = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
> > >
> > > - ball k-means on highly clusterable data often uses a cutoff for
> > centroid
> > > computation of 0.5 x distance to nearest.  I find that with real data
> > that
> > > 1 x distance or even larger to nearest is much better.  The good
> effects
> > > are still mostly there, but you don't need wonderful data to succeed
> > >
> > > - the ball k-means algorithm that we have in Mahout is pretty high
> > quality,
> > > but could use a bit of triangle (Elkan [2] ) speedups.
> > >
> > > *References*
> > >
> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> > > [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
> > >
> > >
> > > >
> > > > >
> > > > > - It can be important to use optimizations in the search for
> nearest
> > > > > centroid.  Consider triangle optimizations.
> > > > >
> > > > > - Do you mean "parallel" when you type || or is there another
> meaning
> > > > > there?
> > > > >
> > > >
> > > > No, i mean method called "kmeans||". It's unfortunate name since I
> > really
> > > > don't know how to make google to search for it.
> > > >
> > > > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> > > >
> > > > >
> > > > > - When you say ++ initialization, many people get this wrong and
> > assume
> > > > > that you mean pick the furthest point.  Getting really good
> > > > initialization
> > > > > is fairly difficult and typically requires more time than the
> actual
> > > > > clustering.  This is one of the key benefits of sketch based
> methods.
> > > > >
> > > > > - Most algorithms require multiple restarts.  At higher dimension
> the
> > > > > number of restarts required becomes very large.  An ideal
> > > implementation
> > > > > does parallel sketch extraction followed by parallel ball k-means
> for
> > > > > restarts.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Considering porting implementation [1] and paper for KMeans ||
> for
> > > > > > Bindings.
> > > > > >
> > > > > > This seems like another method to map fairly nicely.
> > > > > >
> > > > > > The problem I am contemplating is ||-initialization, and in
> > > particular,
> > > > > > centroid storage. That particular implementation assumes
> centroids
> > > > could
> > > > > be
> > > > > > kept in memory in front.
> > > > > >
> > > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > > > > > particularly is, since unlikely people would want more k>1e+6.
> > > Another
> > > > > > thing, centers seem to be passed in via closure attribute (i.e.
> > > > > > java-serialized array-backed matrix).However, with Bindings it is
> > > quite
> > > > > > possible to keep centers at the back as a matrix.
> > > > > >
> > > > > > (2) obviously, LLoyd iterations are not terribly accurate. || and
> > ++
> > > > > > versions mostly speed things up. Is there any better-than-LLoyd
> > > > accuracy
> > > > > > preference?
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: KMeans|| opinions

Posted by Ted Dunning <te...@gmail.com>.

Pipe-pipe is an initialization algorithm which (roughly) parallelizes
something like the ++ algorithm and requires logarithmically many parallel
passes over the data as opposed to the k passes required for ++.

DP-means itself is a variant of k-means where the value of k is not fixed.
 Instead, whenever a data-point is further than a threshold \lambda from
the nearest centroid, it is used to form a new centroid.  Different choices
of \lambda result in different numbers of clusters.  I don't know how well
DP-means by itself would work as an initialization pass.

Streaming k-means uses a variant of dp-means to get a sketch of the data.
 This sketch has many more centroids than the final desired number of
clusters k.  As such, we don't have to worry much about the quality of the
initialization of this first dp-means pass.  The sketch is then clustered
using something like k-means++ and ball k-means iteration.

As I see it, having a good streaming k-means block should make any
clustering algorithm better as long as the sketch is much smaller than the
original data.

The second phase of clustering after streaming k-means can benefit from any
technique that makes clustering better.  The caveat is just that streaming
k-means is likely to produce a small enough sketch that parallel clustering
may not help very much.  Exactly where that trade-off applies depends a lot
on the parallel framework available. Spark and h2o have much lower task
initiation costs so they will definitely be able to get parallel speedup on
much smaller datasets.




On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> is there any opinion whether one of pipe-pipe (||), dp-means
> initializations has bona-fide advantages  one over another?
>
>
> On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > - Have you considered sketch based algorithms?
> > > >
> > >
> > > can you give me a reference. at this point i am just  contemplating
> more
> > or
> > > less shameless port of what they've done in mllib.
> > >
> >
> > Here is the reference I used:
> >
> >
> >
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
> >
> > *A quick summary:*
> >
> > - a single pass over the data with a sort of dp-means [1] algorithm
> using a
> > very fast approximate centroid search can give you k log N centroids.
>  This
> > is the sketch.
> >
> > - clustering these centroids as weighted values gives a provably probably
> > good clustering of the original data for well clusterable data.  For real
> > data, it seems to work exactly as advertised.
> >
> > - clustering the sketch using ball k-means with clever initialization is
> > important.  Good initialization is very expensive in terms of number of
> > points so using a sketch is a really good idea.
> >
> > - the proofs all depend on Euclidean distance.
> >
> > - the threshold can start very small (what small means is determined
> > empirically).  Each time we wind up with too many centroids, recursively
> > clustering the centroids and possibly increasing the threshold will keep
> > the number reasonably bounded.
> >
> > *Some notes from practical experience:*
> >
> > - moderate levels of parallelism are easy since sketches can be merged
> > using set union.  You may want to recursively cluster the centroids at
> this
> > point if you have too many.  This is very nice application of map-reduce.
> >
> > - high degrees of parallelism require multiple levels of merge/collapse
> > since otherwise you wind up with a sketch nearly as large as the original
> > data.  If you have m parallel clusterings then m k log (N/m) can be
> large.
> >  Take a billion points and m = 1000 and k = 10,000.  The size of the
> > parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
> 100, k
> > = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
> >
> > - ball k-means on highly clusterable data often uses a cutoff for
> centroid
> > computation of 0.5 x distance to nearest.  I find that with real data
> that
> > 1 x distance or even larger to nearest is much better.  The good effects
> > are still mostly there, but you don't need wonderful data to succeed
> >
> > - the ball k-means algorithm that we have in Mahout is pretty high
> quality,
> > but could use a bit of triangle (Elkan [2] ) speedups.
> >
> > *References*
> >
> > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> > [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
> >
> >
> > >
> > > >
> > > > - It can be important to use optimizations in the search for nearest
> > > > centroid.  Consider triangle optimizations.
> > > >
> > > > - Do you mean "parallel" when you type || or is there another meaning
> > > > there?
> > > >
> > >
> > > No, i mean method called "kmeans||". It's unfortunate name since I
> really
> > > don't know how to make google to search for it.
> > >
> > > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> > >
> > > >
> > > > - When you say ++ initialization, many people get this wrong and
> assume
> > > > that you mean pick the furthest point.  Getting really good
> > > initialization
> > > > is fairly difficult and typically requires more time than the actual
> > > > clustering.  This is one of the key benefits of sketch based methods.
> > > >
> > > > - Most algorithms require multiple restarts.  At higher dimension the
> > > > number of restarts required becomes very large.  An ideal
> > implementation
> > > > does parallel sketch extraction followed by parallel ball k-means for
> > > > restarts.
> > > >
> > > >
> > > >
> > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > > wrote:
> > > >
> > > > > Considering porting implementation [1] and paper for KMeans || for
> > > > > Bindings.
> > > > >
> > > > > This seems like another method to map fairly nicely.
> > > > >
> > > > > The problem I am contemplating is ||-initialization, and in
> > particular,
> > > > > centroid storage. That particular implementation assumes centroids
> > > could
> > > > be
> > > > > kept in memory in front.
> > > > >
> > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > > > > particularly is, since unlikely people would want more k>1e+6.
> > Another
> > > > > thing, centers seem to be passed in via closure attribute (i.e.
> > > > > java-serialized array-backed matrix).However, with Bindings it is
> > quite
> > > > > possible to keep centers at the back as a matrix.
> > > > >
> > > > > (2) obviously, LLoyd iterations are not terribly accurate. || and
> ++
> > > > > versions mostly speed things up. Is there any better-than-LLoyd
> > > accuracy
> > > > > preference?
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> > > > >
> > > >
> > >
> >
>

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

is there any opinion whether one of pipe-pipe (||), dp-means
initializations has bona-fide advantages  one over another?


On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <te...@gmail.com> wrote:

> On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > - Have you considered sketch based algorithms?
> > >
> >
> > can you give me a reference. at this point i am just  contemplating more
> or
> > less shameless port of what they've done in mllib.
> >
>
> Here is the reference I used:
>
>
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
>
> *A quick summary:*
>
> - a single pass over the data with a sort of dp-means [1] algorithm using a
> very fast approximate centroid search can give you k log N centroids.  This
> is the sketch.
>
> - clustering these centroids as weighted values gives a provably probably
> good clustering of the original data for well clusterable data.  For real
> data, it seems to work exactly as advertised.
>
> - clustering the sketch using ball k-means with clever initialization is
> important.  Good initialization is very expensive in terms of number of
> points so using a sketch is a really good idea.
>
> - the proofs all depend on Euclidean distance.
>
> - the threshold can start very small (what small means is determined
> empirically).  Each time we wind up with too many centroids, recursively
> clustering the centroids and possibly increasing the threshold will keep
> the number reasonably bounded.
>
> *Some notes from practical experience:*
>
> - moderate levels of parallelism are easy since sketches can be merged
> using set union.  You may want to recursively cluster the centroids at this
> point if you have too many.  This is very nice application of map-reduce.
>
> - high degrees of parallelism require multiple levels of merge/collapse
> since otherwise you wind up with a sketch nearly as large as the original
> data.  If you have m parallel clusterings then m k log (N/m) can be large.
>  Take a billion points and m = 1000 and k = 10,000.  The size of the
> parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m = 100, k
> = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
>
> - ball k-means on highly clusterable data often uses a cutoff for centroid
> computation of 0.5 x distance to nearest.  I find that with real data that
> 1 x distance or even larger to nearest is much better.  The good effects
> are still mostly there, but you don't need wonderful data to succeed
>
> - the ball k-means algorithm that we have in Mahout is pretty high quality,
> but could use a bit of triangle (Elkan [2] ) speedups.
>
> *References*
>
> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
>
>
> >
> > >
> > > - It can be important to use optimizations in the search for nearest
> > > centroid.  Consider triangle optimizations.
> > >
> > > - Do you mean "parallel" when you type || or is there another meaning
> > > there?
> > >
> >
> > No, i mean method called "kmeans||". It's unfortunate name since I really
> > don't know how to make google to search for it.
> >
> > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> >
> > >
> > > - When you say ++ initialization, many people get this wrong and assume
> > > that you mean pick the furthest point.  Getting really good
> > initialization
> > > is fairly difficult and typically requires more time than the actual
> > > clustering.  This is one of the key benefits of sketch based methods.
> > >
> > > - Most algorithms require multiple restarts.  At higher dimension the
> > > number of restarts required becomes very large.  An ideal
> implementation
> > > does parallel sketch extraction followed by parallel ball k-means for
> > > restarts.
> > >
> > >
> > >
> > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > Considering porting implementation [1] and paper for KMeans || for
> > > > Bindings.
> > > >
> > > > This seems like another method to map fairly nicely.
> > > >
> > > > The problem I am contemplating is ||-initialization, and in
> particular,
> > > > centroid storage. That particular implementation assumes centroids
> > could
> > > be
> > > > kept in memory in front.
> > > >
> > > > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > > > particularly is, since unlikely people would want more k>1e+6.
> Another
> > > > thing, centers seem to be passed in via closure attribute (i.e.
> > > > java-serialized array-backed matrix).However, with Bindings it is
> quite
> > > > possible to keep centers at the back as a matrix.
> > > >
> > > > (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> > > > versions mostly speed things up. Is there any better-than-LLoyd
> > accuracy
> > > > preference?
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> > > >
> > >
> >
>

Re: KMeans|| opinions

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > - Have you considered sketch based algorithms?
> >
>
> can you give me a reference. at this point i am just  contemplating more or
> less shameless port of what they've done in mllib.
>

Here is the reference I used:

http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets

*A quick summary:*

- a single pass over the data with a sort of dp-means [1] algorithm using a
very fast approximate centroid search can give you k log N centroids.  This
is the sketch.

- clustering these centroids as weighted values gives a provably probably
good clustering of the original data for well clusterable data.  For real
data, it seems to work exactly as advertised.

- clustering the sketch using ball k-means with clever initialization is
important.  Good initialization is very expensive in terms of number of
points so using a sketch is a really good idea.

- the proofs all depend on Euclidean distance.

- the threshold can start very small (what small means is determined
empirically).  Each time we wind up with too many centroids, recursively
clustering the centroids and possibly increasing the threshold will keep
the number reasonably bounded.

*Some notes from practical experience:*

- moderate levels of parallelism are easy since sketches can be merged
using set union.  You may want to recursively cluster the centroids at this
point if you have too many.  This is very nice application of map-reduce.

- high degrees of parallelism require multiple levels of merge/collapse
since otherwise you wind up with a sketch nearly as large as the original
data.  If you have m parallel clusterings then m k log (N/m) can be large.
 Take a billion points and m = 1000 and k = 10,000.  The size of the
parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m = 100, k
= 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.

- ball k-means on highly clusterable data often uses a cutoff for centroid
computation of 0.5 x distance to nearest.  I find that with real data that
1 x distance or even larger to nearest is much better.  The good effects
are still mostly there, but you don't need wonderful data to succeed

- the ball k-means algorithm that we have in Mahout is pretty high quality,
but could use a bit of triangle (Elkan [2] ) speedups.

*References*

[1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
[2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf


>
> >
> > - It can be important to use optimizations in the search for nearest
> > centroid.  Consider triangle optimizations.
> >
> > - Do you mean "parallel" when you type || or is there another meaning
> > there?
> >
>
> No, i mean method called "kmeans||". It's unfortunate name since I really
> don't know how to make google to search for it.
>
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
>
> >
> > - When you say ++ initialization, many people get this wrong and assume
> > that you mean pick the furthest point.  Getting really good
> initialization
> > is fairly difficult and typically requires more time than the actual
> > clustering.  This is one of the key benefits of sketch based methods.
> >
> > - Most algorithms require multiple restarts.  At higher dimension the
> > number of restarts required becomes very large.  An ideal implementation
> > does parallel sketch extraction followed by parallel ball k-means for
> > restarts.
> >
> >
> >
> > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Considering porting implementation [1] and paper for KMeans || for
> > > Bindings.
> > >
> > > This seems like another method to map fairly nicely.
> > >
> > > The problem I am contemplating is ||-initialization, and in particular,
> > > centroid storage. That particular implementation assumes centroids
> could
> > be
> > > kept in memory in front.
> > >
> > > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > > particularly is, since unlikely people would want more k>1e+6. Another
> > > thing, centers seem to be passed in via closure attribute (i.e.
> > > java-serialized array-backed matrix).However, with Bindings it is quite
> > > possible to keep centers at the back as a matrix.
> > >
> > > (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> > > versions mostly speed things up. Is there any better-than-LLoyd
> accuracy
> > > preference?
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> > >
> >
>

Re: KMeans|| opinions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <te...@gmail.com> wrote:

> - Have you considered sketch based algorithms?
>

can you give me a reference. at this point i am just  contemplating more or
less shameless port of what they've done in mllib.


>
> - It can be important to use optimizations in the search for nearest
> centroid.  Consider triangle optimizations.
>
> - Do you mean "parallel" when you type || or is there another meaning
> there?
>

No, i mean method called "kmeans||". It's unfortunate name since I really
don't know how to make google to search for it.

http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

>
> - When you say ++ initialization, many people get this wrong and assume
> that you mean pick the furthest point.  Getting really good initialization
> is fairly difficult and typically requires more time than the actual
> clustering.  This is one of the key benefits of sketch based methods.
>
> - Most algorithms require multiple restarts.  At higher dimension the
> number of restarts required becomes very large.  An ideal implementation
> does parallel sketch extraction followed by parallel ball k-means for
> restarts.
>
>
>
> On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Considering porting implementation [1] and paper for KMeans || for
> > Bindings.
> >
> > This seems like another method to map fairly nicely.
> >
> > The problem I am contemplating is ||-initialization, and in particular,
> > centroid storage. That particular implementation assumes centroids could
> be
> > kept in memory in front.
> >
> > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > particularly is, since unlikely people would want more k>1e+6. Another
> > thing, centers seem to be passed in via closure attribute (i.e.
> > java-serialized array-backed matrix).However, with Bindings it is quite
> > possible to keep centers at the back as a matrix.
> >
> > (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> > versions mostly speed things up. Is there any better-than-LLoyd accuracy
> > preference?
> >
> >
> > [1]
> >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> >
>

Re: KMeans|| opinions

Posted by Ted Dunning <te...@gmail.com>.

- Have you considered sketch based algorithms?

- It can be important to use optimizations in the search for nearest
centroid.  Consider triangle optimizations.

- Do you mean "parallel" when you type || or is there another meaning there?

- When you say ++ initialization, many people get this wrong and assume
that you mean pick the furthest point.  Getting really good initialization
is fairly difficult and typically requires more time than the actual
clustering.  This is one of the key benefits of sketch based methods.

- Most algorithms require multiple restarts.  At higher dimension the
number of restarts required becomes very large.  An ideal implementation
does parallel sketch extraction followed by parallel ball k-means for
restarts.



On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Considering porting implementation [1] and paper for KMeans || for
> Bindings.
>
> This seems like another method to map fairly nicely.
>
> The problem I am contemplating is ||-initialization, and in particular,
> centroid storage. That particular implementation assumes centroids could be
> kept in memory in front.
>
> (1) Question is, is it a dangerous idea. It doesn't seem like it
> particularly is, since unlikely people would want more k>1e+6. Another
> thing, centers seem to be passed in via closure attribute (i.e.
> java-serialized array-backed matrix).However, with Bindings it is quite
> possible to keep centers at the back as a matrix.
>
> (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> versions mostly speed things up. Is there any better-than-LLoyd accuracy
> preference?
>
>
> [1]
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
>