You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Artem Barger <ar...@bargr.net> on 2016/05/30 22:28:48 UTC

[Math] MATH-1371: Provide accelerated kmeans++ implementation

Hi,

I've used out of the box current KMeansPlusPlusClusterer implementation
provided by CM, however saw that it doesn't scales well on large data
volumes. One of the proposals to improve current implementation was
submitted in JIRA-1330 is to provide support for sparse big data, i.e. make
clustering algorithm to work w/ SparseVectors.

While working on 1330, I've bumped into: Elkan, Charles. "Using the
triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper which
described additional possibility of scaling up performance of kmeans
algorithm. Method based on usage of triangle inequality to avoid
unnecessary distance computations cause by small movement of the cluster
center which doesn't affect assignment of given point to the cluster.

Simply tests using PerfTestUtils shows that using Elkan's algorithm over
the standard provided currently in CM could achieve performance boost in
order of magnitude for significantly large inputs.

I've opened MATH-1371 "Provide accelerated kmeans++ implementation"
https://issues.apache.org/jira/browse/MATH-1371 and attached my
implementation there. Will be glad to receive review and comments about it.

I believe that switching to this algorithm instead of regular one could
improve quality of CM provided solution for kmeans.

Best regards,
                      Artem Barger.

Re: [Math] MATH-1371: Provide accelerated kmeans++ implementation

Posted by Artem Barger <ar...@bargr.net>.
On Tue, May 31, 2016 at 3:31 AM, Gilles <gi...@harfang.homelinux.org>
wrote:

> On Tue, 31 May 2016 03:10:20 +0300, Artem Barger wrote:
>
>>
>>>
>>>>> ​Yes, you can parallelize it, though it will cancel several
>>>>> optimizations
>>>>>
>>>> I've added.
>>>> In fact you can partition the input according to number of threads you'd
>>>> like to use
>>>> and make each thread to take care of relevant data chunk.
>>>>
>>>> I guess it will increase performance, not sure why current
>>>> implementation
>>>> wasn't
>>>> parallelized.​
>>>>
>>>>
>>>> You are most welcome to enhance it. ;-)
>>>
>>>
>> Well, already did it ;-)
>> And if you mean to parallelize my current implementation for Elkan
>> algorithm,
>> I think that it should be handled as a separate task.​
>>
>>
> What I mean is that algorithms that can perform their computation
> on multiple threads should be implemented in such a way that the
> feature can be used.
>
> Of course, that means a more complex code, to ensure thread-safety.
> So, for sure, your code could be added now (modulo the remarks)
> if you don't want to handle that now.v And as you pointed out the
> other implementations are not multi-thread either.
>
> But it's a direction worth exploring I think.



​I totally agree w/ you, simply suggesting to go incremental, first to have
work done on
all comments provided, then having it committed and accepted I can
multi-threaded
support either.

Also I think that multi-threading support might impose some API changes and
should be
added carefully :)))​

Re: [Math] MATH-1371: Provide accelerated kmeans++ implementation

Posted by Gilles <gi...@harfang.homelinux.org>.
On Tue, 31 May 2016 03:10:20 +0300, Artem Barger wrote:
>>
>>>>
>>>> \u200bYes, you can parallelize it, though it will cancel several 
>>>> optimizations
>>> I've added.
>>> In fact you can partition the input according to number of threads 
>>> you'd
>>> like to use
>>> and make each thread to take care of relevant data chunk.
>>>
>>> I guess it will increase performance, not sure why current 
>>> implementation
>>> wasn't
>>> parallelized.\u200b
>>>
>>>
>> You are most welcome to enhance it. ;-)
>>
>
> Well, already did it ;-)
> And if you mean to parallelize my current implementation for Elkan
> algorithm,
> I think that it should be handled as a separate task.\u200b
>

What I mean is that algorithms that can perform their computation
on multiple threads should be implemented in such a way that the
feature can be used.

Of course, that means a more complex code, to ensure thread-safety.
So, for sure, your code could be added now (modulo the remarks)
if you don't want to handle that now.v And as you pointed out the
other implementations are not multi-thread either.

But it's a direction worth exploring I think.

Gilles



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1371: Provide accelerated kmeans++ implementation

Posted by Artem Barger <ar...@bargr.net>.
>
>>>
>>> ​Yes, you can parallelize it, though it will cancel several optimizations
>> I've added.
>> In fact you can partition the input according to number of threads you'd
>> like to use
>> and make each thread to take care of relevant data chunk.
>>
>> I guess it will increase performance, not sure why current implementation
>> wasn't
>> parallelized.​
>>
>>
> You are most welcome to enhance it. ;-)
>

Well, already did it ;-)
And if you mean to parallelize my current implementation for Elkan
algorithm,
I think that it should be handled as a separate task.​



>
> Gilles
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [Math] MATH-1371: Provide accelerated kmeans++ implementation

Posted by Gilles <gi...@harfang.homelinux.org>.
On Tue, 31 May 2016 02:42:03 +0300, Artem Barger wrote:
> On Tue, May 31, 2016 at 2:20 AM, Gilles 
> <gi...@harfang.homelinux.org>
> wrote:
>
>> On Tue, 31 May 2016 01:28:48 +0300, Artem Barger wrote:
>>
>>> Hi,
>>>
>>> I've used out of the box current KMeansPlusPlusClusterer 
>>> implementation
>>> provided by CM, however saw that it doesn't scales well on large 
>>> data
>>> volumes. One of the proposals to improve current implementation was
>>> submitted in JIRA-1330 is to provide support for sparse big data, 
>>> i.e.
>>> make
>>> clustering algorithm to work w/ SparseVectors.
>>>
>>> While working on 1330, I've bumped into: Elkan, Charles. "Using the
>>> triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. 
>>> paper
>>> which
>>> described additional possibility of scaling up performance of 
>>> kmeans
>>> algorithm. Method based on usage of triangle inequality to avoid
>>> unnecessary distance computations cause by small movement of the 
>>> cluster
>>> center which doesn't affect assignment of given point to the 
>>> cluster.
>>>
>>> Simply tests using PerfTestUtils shows that using Elkan's algorithm 
>>> over
>>> the standard provided currently in CM could achieve performance 
>>> boost in
>>> order of magnitude for significantly large inputs.
>>>
>>
>> Impressive. :-)
>>
>> I've opened MATH-1371 "Provide accelerated kmeans++ implementation"
>>> https://issues.apache.org/jira/browse/MATH-1371 and attached my
>>> implementation there. Will be glad to receive review and comments 
>>> about
>>> it.
>>>
>>> I believe that switching to this algorithm instead of regular one 
>>> could
>>> improve quality of CM provided solution for kmeans.
>>>
>>
>> Are these algorithms parallelizable?
>> Wouldn't such an implementation increase performance even more?
>>
>>
> \u200bYes, you can parallelize it, though it will cancel several 
> optimizations
> I've added.
> In fact you can partition the input according to number of threads 
> you'd
> like to use
> and make each thread to take care of relevant data chunk.
>
> I guess it will increase performance, not sure why current 
> implementation
> wasn't
> parallelized.\u200b
>

You are most welcome to enhance it. ;-)

Gilles


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [Math] MATH-1371: Provide accelerated kmeans++ implementation

Posted by Artem Barger <ar...@bargr.net>.
On Tue, May 31, 2016 at 2:20 AM, Gilles <gi...@harfang.homelinux.org>
wrote:

> On Tue, 31 May 2016 01:28:48 +0300, Artem Barger wrote:
>
>> Hi,
>>
>> I've used out of the box current KMeansPlusPlusClusterer implementation
>> provided by CM, however saw that it doesn't scales well on large data
>> volumes. One of the proposals to improve current implementation was
>> submitted in JIRA-1330 is to provide support for sparse big data, i.e.
>> make
>> clustering algorithm to work w/ SparseVectors.
>>
>> While working on 1330, I've bumped into: Elkan, Charles. "Using the
>> triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper
>> which
>> described additional possibility of scaling up performance of kmeans
>> algorithm. Method based on usage of triangle inequality to avoid
>> unnecessary distance computations cause by small movement of the cluster
>> center which doesn't affect assignment of given point to the cluster.
>>
>> Simply tests using PerfTestUtils shows that using Elkan's algorithm over
>> the standard provided currently in CM could achieve performance boost in
>> order of magnitude for significantly large inputs.
>>
>
> Impressive. :-)
>
> I've opened MATH-1371 "Provide accelerated kmeans++ implementation"
>> https://issues.apache.org/jira/browse/MATH-1371 and attached my
>> implementation there. Will be glad to receive review and comments about
>> it.
>>
>> I believe that switching to this algorithm instead of regular one could
>> improve quality of CM provided solution for kmeans.
>>
>
> Are these algorithms parallelizable?
> Wouldn't such an implementation increase performance even more?
>
>
​Yes, you can parallelize it, though it will cancel several optimizations
I've added.
In fact you can partition the input according to number of threads you'd
like to use
and make each thread to take care of relevant data chunk.

I guess it will increase performance, not sure why current implementation
wasn't
parallelized.​



> Regards,
> Gilles
>
>
>> Best regards,
>>                       Artem Barger.
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [Math] MATH-1371: Provide accelerated kmeans++ implementation

Posted by Gilles <gi...@harfang.homelinux.org>.
On Tue, 31 May 2016 01:28:48 +0300, Artem Barger wrote:
> Hi,
>
> I've used out of the box current KMeansPlusPlusClusterer 
> implementation
> provided by CM, however saw that it doesn't scales well on large data
> volumes. One of the proposals to improve current implementation was
> submitted in JIRA-1330 is to provide support for sparse big data, 
> i.e. make
> clustering algorithm to work w/ SparseVectors.
>
> While working on 1330, I've bumped into: Elkan, Charles. "Using the
> triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper 
> which
> described additional possibility of scaling up performance of kmeans
> algorithm. Method based on usage of triangle inequality to avoid
> unnecessary distance computations cause by small movement of the 
> cluster
> center which doesn't affect assignment of given point to the cluster.
>
> Simply tests using PerfTestUtils shows that using Elkan's algorithm 
> over
> the standard provided currently in CM could achieve performance boost 
> in
> order of magnitude for significantly large inputs.

Impressive. :-)

> I've opened MATH-1371 "Provide accelerated kmeans++ implementation"
> https://issues.apache.org/jira/browse/MATH-1371 and attached my
> implementation there. Will be glad to receive review and comments 
> about it.
>
> I believe that switching to this algorithm instead of regular one 
> could
> improve quality of CM provided solution for kmeans.

Are these algorithms parallelizable?
Wouldn't such an implementation increase performance even more?

Regards,
Gilles

>
> Best regards,
>                       Artem Barger.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org