You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Derrick Burns (JIRA)" <ji...@apache.org> on 2014/08/27 18:19:58 UTC

[jira] [Closed] (SPARK-3253) KMeans cluster will fail on large number of clusters/high dimensional data

     [ https://issues.apache.org/jira/browse/SPARK-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Derrick Burns closed SPARK-3253.
--------------------------------

    Resolution: Invalid

> KMeans cluster will fail on large number of clusters/high dimensional data
> --------------------------------------------------------------------------
>
>                 Key: SPARK-3253
>                 URL: https://issues.apache.org/jira/browse/SPARK-3253
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Derrick Burns
>
> The latest changes to use broadcast to communicate cluster centers to workers keeps closure size small, but does not avoid the problem of returning the cluster centers to the master in the final collect() stage. At this step, the collect() may fail because the resulting cluster centers are larger than the akka framesize can accommodate.  What is frustrating about this is that there is no indication that the failure was caused by the frame size being exceeded.  This makes this a Major issue, even though there is a simple workaround, i.e. increasing the frame size. 
> What would be helpful is a check BEFORE the clusterer begins the heavy lifting.  The check would compute the expected result size and compare it to the akka frame size.  If the result won't fit, at the very least it emits a warning.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org