You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Derrick Burns (JIRA)" <ji...@apache.org> on 2014/08/28 02:07:57 UTC

[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

    [ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113095#comment-14113095 ] 

Derrick Burns commented on SPARK-3261:
--------------------------------------

This choice also adversely affects performance.  I just ran clustering on 1.3M points, asking for 10,000 clusters.  This clustering run resulted in 1019 unique cluster centers.  The original algorithm ran for 4.5 hours.  The algorithm that does not allow cluster centers completed in 45 minutes for a 6x speedup in this dataset. 

> KMeans clusterer can return duplicate cluster centers
> -----------------------------------------------------
>
>                 Key: SPARK-3261
>                 URL: https://issues.apache.org/jira/browse/SPARK-3261
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Derrick Burns
>
> This is a bad design choice.  I think that it is preferable to produce no duplicate cluster centers. So instead of forcing the number of clusters to be K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org