You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by CT <ch...@qq.com> on 2020/01/20 02:08:23 UTC

[math]New feature MiniBatchKMeansClusterer

Hi,&nbsp; In my picture search project, I need a cluster algorithm to narrow the dataset, for accelerate the search on millions of pictures.
&nbsp; First we use python+pytorch+kmean, with the growing data from thousands to millions, the KMeans clustering became slower and slower(seconds to minutes), then we find MiniBatchKMeans could amazing finish the clustering in 1~2 seconds on millions of data.
&nbsp; Meanwhile we still faced the insufficient concurrent capacity of python, so we switch to kotlin on jvm.
&nbsp; But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache Commons Math(Deeplearning4j was also considered, but it is too slow because of ND4j's design).


&nbsp; I'd like to contribute it to Apache Commons Math, and I wrote a java version: https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans


&nbsp; From my test(Kotlin version), it is very fast, but gives slightly different results with&nbsp;KMeans++ in most case, but sometimes has big different(May be affected by the randomness of the mini batch):




Some bad case:

It even worse when I use RandomSource.create(RandomSource.MT_64, 0)&nbsp;for the random generator ┐(´-`)┌.


My brief understanding of MiniBatchKMeans:
Use a partial points in initialize cluster centers, and random mini batch in training iterations.
It can finish in few seconds when clustering millions of data, and has few differences between KMeans.


More information about MiniBatchKMeans
&nbsp; https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
&nbsp; https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

Re: [math]New feature MiniBatchKMeansClusterer

Posted by Gilles Sadowski <gi...@gmail.com>.
Hi.

2020-01-20 3:08 UTC+01:00, CT <ch...@qq.com>:
> Hi,&nbsp; In my picture search project, I need a cluster algorithm to narrow
> the dataset, for accelerate the search on millions of pictures.
> &nbsp; First we use python+pytorch+kmean, with the growing data from
> thousands to millions, the KMeans clustering became slower and
> slower(seconds to minutes), then we find MiniBatchKMeans could amazing
> finish the clustering in 1~2 seconds on millions of data.

Impressive improvement.

> &nbsp; Meanwhile we still faced the insufficient concurrent capacity of
> python, so we switch to kotlin on jvm.
> &nbsp; But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote
> one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache
> Commons Math(Deeplearning4j was also considered, but it is too slow because
> of ND4j's design).
>
>
> &nbsp; I'd like to contribute it to Apache Commons Math,

Thanks!

> and I wrote a java
> version:
> https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans

Some remarks:

* I didn't get why the "KMeansPlusPlusCentroidInitializer" class
does not call the existing "KMeansPlusPlusClusterer".
Code seems duplicated: As there is a case for reuse, the currently
"private" centroid initialization code should be factored out.

* In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
that a computation is performed (as opposed to selecting from an
existing data structure).

* Not convinced that there should be so many constructors (in most
cases, it makes no sense to pick default values that are likely to
be heavily application-dependent.

* Input should generally be validated: e.g. the maximum number of
iterations should not be changed unwittingly; rather, an exception
should be raised if the user passed a negative value.

>
> &nbsp; From my test(Kotlin version), it is very fast, but gives slightly
> different results with&nbsp;KMeans++ in most case, but sometimes has big
> different(May be affected by the randomness of the mini batch):

Could be nice to illustrate (not just with a picture, but in a table
with entries average over several runs) the differences in result
between the implementations, using various setups (number of
clusters, stopping criterion, etc.).

>
>
>
> Some bad case:
>
> It even worse when I use RandomSource.create(RandomSource.MT_64, 0)&nbsp;for
> the random generator ┐(´-`)┌.

"MT_64" is probably not the best default.  And this is one of the
parameters from which there should not be a default IMO.

[Note: there are spurious characters in your message (see e.g. the
paragraph quoted just above) that make it difficult to read.]

Best regards,
Gilles

>
>
> My brief understanding of MiniBatchKMeans:
> Use a partial points in initialize cluster centers, and random mini batch in
> training iterations.
> It can finish in few seconds when clustering millions of data, and has few
> differences between KMeans.
>
>
> More information about MiniBatchKMeans
> &nbsp; https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
> &nbsp;
> https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org