You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by "Nate Paymer (JIRA)" <ji...@apache.org> on 2011/03/12 06:13:04 UTC

[jira] Created: (MATH-548) KMeansPlusPlusClusterer should run multiple trials

KMeansPlusPlusClusterer should run multiple trials
--------------------------------------------------

                 Key: MATH-548
                 URL: https://issues.apache.org/jira/browse/MATH-548
             Project: Commons Math
          Issue Type: Improvement
            Reporter: Nate Paymer
            Priority: Minor


The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters.  But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters.  It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.

I propose adding a new method to KMeansPlusPlusClusterer:
  List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
which calls the existing cluster() method numTrials times, returning the best result.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MATH-548) KMeansPlusPlusClusterer should run multiple trials

Posted by "Luc Maisonobe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MATH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044562#comment-13044562 ] 

Luc Maisonobe commented on MATH-548:
------------------------------------

If I understand correctly, you suggest putting something similar to the multi-start feature from the optimization package.

What I don't get is how we can define "best result".

> KMeansPlusPlusClusterer should run multiple trials
> --------------------------------------------------
>
>                 Key: MATH-548
>                 URL: https://issues.apache.org/jira/browse/MATH-548
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Nate Paymer
>            Priority: Minor
>
> The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters.  But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters.  It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.
> I propose adding a new method to KMeansPlusPlusClusterer:
>   List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
> which calls the existing cluster() method numTrials times, returning the best result.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MATH-548) KMeansPlusPlusClusterer should run multiple trials

Posted by "Nate Paymer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MATH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044654#comment-13044654 ] 

Nate Paymer commented on MATH-548:
----------------------------------

The best result would be the one that minimizes the sum of the squared distance from each point to the center of its cluster.  See http://en.wikipedia.org/wiki/K-means_clustering#Description

> KMeansPlusPlusClusterer should run multiple trials
> --------------------------------------------------
>
>                 Key: MATH-548
>                 URL: https://issues.apache.org/jira/browse/MATH-548
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Nate Paymer
>            Priority: Minor
>
> The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters.  But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters.  It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.
> I propose adding a new method to KMeansPlusPlusClusterer:
>   List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
> which calls the existing cluster() method numTrials times, returning the best result.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MATH-548) KMeansPlusPlusClusterer should run multiple trials

Posted by "Luc Maisonobe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MATH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Luc Maisonobe resolved MATH-548.
--------------------------------

    Resolution: Fixed

Fixed in subversion tree as of r1137759.

Thanks for the report


> KMeansPlusPlusClusterer should run multiple trials
> --------------------------------------------------
>
>                 Key: MATH-548
>                 URL: https://issues.apache.org/jira/browse/MATH-548
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Nate Paymer
>            Priority: Minor
>
> The interface and documentation for KMeansPlusPlusClusterer imply that a single call to cluster() is sufficient to get the optimal set of clusters.  But this isn't true -- practically every client should be calling cluster() multiple times, selecting the best resulting set of clusters.  It seems to me that rather than forcing every client to implement this functionality, it should be placed directly in the KMeansPlusPlusClusterer class.
> I propose adding a new method to KMeansPlusPlusClusterer:
>   List<Cluster<T>> cluster(Collection<T> points, int k, int numTrials, int maxIterationsPerTrial)
> which calls the existing cluster() method numTrials times, returning the best result.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira