You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Artem Barger <ar...@bargr.net> on 2016/05/31 13:04:49 UTC

[Math] kmeans++: decouple EM LLoyd's iterations and initial seeding of clustering centers.

Hi,

Current implementation of kmeans within CM framework, inherently uses
algorithm published by  Arthur, David, and Sergei Vassilvitskii.
"k-means++: The advantages of careful seeding." *Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete algorithms*. Society for
Industrial and Applied Mathematics, 2007. While there other alternative
algorithms for initial seeding is available, for instance:

1. Random initialization (each center picked uniformly at random).
2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
3. Bicriteria  Feldman, Dan, et al. "Bi-criteria linear-time approximations
for generalized k-mean/median/center." *Proceedings of the twenty-third
annual symposium on Computational geometry*. ACM, 2007.

While I understand that kmeans++ is preferable option, others could be also
used for testing, trials and evaluations as well.

I'd like to propose to separate logic of seeding and clustering to increase
flexibility for kmeans clustering. Would be glad to hear your comments,
pros/cons or rejections...

Thanks,
                      Artem Barger.

Re: [Math] kmeans++: decouple EM LLoyd's iterations and initial seeding of clustering centers.

Posted by Artem Barger <ar...@bargr.net>.

On Wed, Jun 1, 2016 at 5:46 PM, Gilles <gi...@harfang.homelinux.org> wrote:

> On Wed, 1 Jun 2016 17:24:47 +0300, Artem Barger wrote:
>
>> 
>> On Tue, May 31, 2016 at 4:04 PM, Artem Barger <ar...@bargr.net> wrote:
>>
>> Hi,
>>>
>>> Current implementation of kmeans within CM framework, inherently uses
>>> algorithm published by  Arthur, David, and Sergei Vassilvitskii.
>>> "k-means++: The advantages of careful seeding." *Proceedings of the
>>> eighteenth annual ACM-SIAM symposium on Discrete algorithms*. Society for
>>> Industrial and Applied Mathematics, 2007. While there other alternative
>>> algorithms for initial seeding is available, for instance:
>>>
>>> 1. Random initialization (each center picked uniformly at random).
>>> 2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
>>> 3. Bicriteria  Feldman, Dan, et al. "Bi-criteria linear-time
>>> approximations for generalized k-mean/median/center." *Proceedings of the
>>> twenty-third annual symposium on Computational geometry*. ACM, 2007.
>>>
>>> While I understand that kmeans++ is preferable option, others could be
>>> also used for testing, trials and evaluations as well.
>>>
>>> I'd like to propose to separate logic of seeding and clustering to
>>> increase flexibility for kmeans clustering. Would be glad to hear your
>>> comments, pros/cons or rejections...
>>>
>>>
>>> I've found "Scalable KMeans" or kmeans|| as referred in the
>> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which
>> provides
>> parallelizable seeding procedure.
>> I guess this might serve as additional +1 vote for doing separation
>> between seeding and LLoyd's iterations in current implementations of
>> kmeans.
>>
>
> I guess that, around here, you are the expert about these algorithms...
>

Thanks for providing me a credit, while I'm still not an expert :)
I'd say "have reasonable knowledge"...

So go ahead and (re)write the code as you see fit, while still taking into
> account that the code should be self-documenting as much as possible.
> And OO (since this is Java).
>

Will do my best.


> If you are up for a major refactoring (e.g. for sparse data), I'd suggest
> to do it in a new package, so that we can easily compare the old and new
> codes (e.g. run the tests).
>

I'm working on it, this is not as trivial as I initially though, described
several concerns
in other threads, one of them for example the usage of generic parameter T
and some
assumption which enforced by Clusterable interface.

And if you contemplate parallelization, I wonder whether the issue of
> switching to Java 8 might not have to be resolved first.
>

I'm absolutely positive switching to Java 8, as I can see many benefits
both API and
performance 

wise. What it is the proper process to move for Java 8?


Best,
    Artem Barger.

Re: [Math] kmeans++: decouple EM LLoyd's iterations and initial seeding of clustering centers.

Posted by Gilles <gi...@harfang.homelinux.org>.

On Wed, 1 Jun 2016 17:24:47 +0300, Artem Barger wrote:
> \u200b
> On Tue, May 31, 2016 at 4:04 PM, Artem Barger <ar...@bargr.net> 
> wrote:
>
>> Hi,
>>
>> Current implementation of kmeans within CM framework, inherently 
>> uses
>> algorithm published by  Arthur, David, and Sergei Vassilvitskii.
>> "k-means++: The advantages of careful seeding." *Proceedings of the
>> eighteenth annual ACM-SIAM symposium on Discrete algorithms*. 
>> Society for
>> Industrial and Applied Mathematics, 2007. While there other 
>> alternative
>> algorithms for initial seeding is available, for instance:
>>
>> 1. Random initialization (each center picked uniformly at random).
>> 2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
>> 3. Bicriteria  Feldman, Dan, et al. "Bi-criteria linear-time
>> approximations for generalized k-mean/median/center." *Proceedings 
>> of the
>> twenty-third annual symposium on Computational geometry*. ACM, 2007.
>>
>> While I understand that kmeans++ is preferable option, others could 
>> be
>> also used for testing, trials and evaluations as well.
>>
>> I'd like to propose to separate logic of seeding and clustering to
>> increase flexibility for kmeans clustering. Would be glad to hear 
>> your
>> comments, pros/cons or rejections...
>>
>>
> I've found "Scalable KMeans" or kmeans|| as referred in the
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which 
> provides
> parallelizable seeding procedure.
> \u200bI guess this might serve as additional +1 vote for doing separation
> between seeding and LLoyd's iterations in current implementations of 
> kmeans.

I guess that, around here, you are the expert about these algorithms...
So go ahead and (re)write the code as you see fit, while still taking 
into
account that the code should be self-documenting as much as possible.
And OO (since this is Java).

If you are up for a major refactoring (e.g. for sparse data), I'd 
suggest
to do it in a new package, so that we can easily compare the old and 
new
codes (e.g. run the tests).

And if you contemplate parallelization, I wonder whether the issue of
switching to Java 8 might not have to be resolved first.


Regards,
Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [Math] kmeans++: decouple EM LLoyd's iterations and initial seeding of clustering centers.

Posted by Artem Barger <ar...@bargr.net>.


On Tue, May 31, 2016 at 4:04 PM, Artem Barger <ar...@bargr.net> wrote:

> Hi,
>
> Current implementation of kmeans within CM framework, inherently uses
> algorithm published by  Arthur, David, and Sergei Vassilvitskii.
> "k-means++: The advantages of careful seeding." *Proceedings of the
> eighteenth annual ACM-SIAM symposium on Discrete algorithms*. Society for
> Industrial and Applied Mathematics, 2007. While there other alternative
> algorithms for initial seeding is available, for instance:
>
> 1. Random initialization (each center picked uniformly at random).
> 2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
> 3. Bicriteria  Feldman, Dan, et al. "Bi-criteria linear-time
> approximations for generalized k-mean/median/center." *Proceedings of the
> twenty-third annual symposium on Computational geometry*. ACM, 2007.
>
> While I understand that kmeans++ is preferable option, others could be
> also used for testing, trials and evaluations as well.
>
> I'd like to propose to separate logic of seeding and clustering to
> increase flexibility for kmeans clustering. Would be glad to hear your
> comments, pros/cons or rejections...
>
>
I've found "Scalable KMeans" or kmeans|| as referred in the
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which provides
parallelizable seeding procedure.
I guess this might serve as additional +1 vote for doing separation
between seeding and LLoyd's iterations in current implementations of kmeans.

Best,
    Artem Barger.