You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Meethu Mathew (JIRA)" <ji...@apache.org> on 2016/01/28 13:31:40 UTC

[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib

    [ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121332#comment-15121332 ] 

Meethu Mathew commented on SPARK-8402:
--------------------------------------

[~mengxr] [~josephkb] This ticket is in idle state for a long time . Could you please comment on what we can do next?

> Add DP means clustering to MLlib
> --------------------------------
>
>                 Key: SPARK-8402
>                 URL: https://issues.apache.org/jira/browse/SPARK-8402
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Meethu Mathew
>            Assignee: Meethu Mathew
>              Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance. 
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters. 
> DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters ["Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized Mutual Information between ground truth clusters and algorithm outputs, have been provided in the following table. It can be seen from the table that DP-means reported a higher NMI on 5 of 8 data sets in comparison to k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics (2011) Arxiv:1111.0352. (Table 1)]
> | Dataset       | DP-means | k-means |
> | Wine          | .41      | .43     |
> | Iris          | .75      | .76     |
> | Pima          | .02      | .03     |
> | Soybean       | .72      | .66     |
> | Car           | .07      | .05     |
> | Balance Scale | .17      | .11     |
> | Breast Cancer | .04      | .03     |
> | Vehicle       | .18      | .18     |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on mesos where each node config was 8 Cores, 64 GB RAM and the spark version used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of features and instances. The results from the benchmark study are provided below. The reported stats are average over 5 runs. 
> | DATASET     |            |         DPMEANS         |       |                         | KMEANS (k =10) |                         |
> | Instances   | Dimensions | No of clusters obtained | Time  | Converged in iterations |      Time      | Converged in iterations |
> |  10 million |     10     |            10           | 43.6s |            2            |      52.2s     |            2            |
> |  1 million  |     100    |            10           | 39.8s |            2            |     43.39s     |            2            |
> | 0.1 million |    1000    |            10           | 37.3s |            2            |     41.64s     |            2            |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org