You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Nate Paymer (JIRA)" <ji...@apache.org> on 2011/03/12 06:13:04 UTC
[jira] Created: (MATH-548) KMeansPlusPlusClusterer should run
multiple trials
KMeansPlusPlusClusterer should run multiple trials
--------------------------------------------------
Key: MATH-548
URL: https://issues.apache.org/jira/browse/MATH-548
Project: Commons Math
Issue Type: Improvement
Reporter: Nate Paymer
Priority: Minor
The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters. But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters. It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.
I propose adding a new method to KMeansPlusPlusClusterer:
List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
which calls the existing cluster() method numTrials times, returning the best result.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MATH-548) KMeansPlusPlusClusterer should run
multiple trials
Posted by "Luc Maisonobe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MATH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044562#comment-13044562 ]
Luc Maisonobe commented on MATH-548:
------------------------------------
If I understand correctly, you suggest putting something similar to the multi-start feature from the optimization package.
What I don't get is how we can define "best result".
> KMeansPlusPlusClusterer should run multiple trials
> --------------------------------------------------
>
> Key: MATH-548
> URL: https://issues.apache.org/jira/browse/MATH-548
> Project: Commons Math
> Issue Type: Improvement
> Reporter: Nate Paymer
> Priority: Minor
>
> The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters. But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters. It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.
> I propose adding a new method to KMeansPlusPlusClusterer:
> List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
> which calls the existing cluster() method numTrials times, returning the best result.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MATH-548) KMeansPlusPlusClusterer should run
multiple trials
Posted by "Nate Paymer (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MATH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044654#comment-13044654 ]
Nate Paymer commented on MATH-548:
----------------------------------
The best result would be the one that minimizes the sum of the squared distance from each point to the center of its cluster. See http://en.wikipedia.org/wiki/K-means_clustering#Description
> KMeansPlusPlusClusterer should run multiple trials
> --------------------------------------------------
>
> Key: MATH-548
> URL: https://issues.apache.org/jira/browse/MATH-548
> Project: Commons Math
> Issue Type: Improvement
> Reporter: Nate Paymer
> Priority: Minor
>
> The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters. But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters. It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.
> I propose adding a new method to KMeansPlusPlusClusterer:
> List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
> which calls the existing cluster() method numTrials times, returning the best result.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MATH-548) KMeansPlusPlusClusterer should run
multiple trials
Posted by "Luc Maisonobe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MATH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luc Maisonobe resolved MATH-548.
--------------------------------
Resolution: Fixed
Fixed in subversion tree as of r1137759.
Thanks for the report
> KMeansPlusPlusClusterer should run multiple trials
> --------------------------------------------------
>
> Key: MATH-548
> URL: https://issues.apache.org/jira/browse/MATH-548
> Project: Commons Math
> Issue Type: Improvement
> Reporter: Nate Paymer
> Priority: Minor
>
> The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters. But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters. It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.
> I propose adding a new method to KMeansPlusPlusClusterer:
> List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
> which calls the existing cluster() method numTrials times, returning the best result.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira