You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Meethu Mathew (JIRA)" <ji...@apache.org> on 2016/01/28 13:31:40 UTC
[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121332#comment-15121332 ]
Meethu Mathew commented on SPARK-8402:
--------------------------------------
[~mengxr] [~josephkb] This ticket is in idle state for a long time . Could you please comment on what we can do next?
> Add DP means clustering to MLlib
> --------------------------------
>
> Key: SPARK-8402
> URL: https://issues.apache.org/jira/browse/SPARK-8402
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Meethu Mathew
> Assignee: Meethu Mathew
> Labels: features
>
> At present, all the clustering algorithms in MLlib require the number of clusters to be specified in advance.
> The Dirichlet process (DP) is a popular non-parametric Bayesian mixture model that allows for flexible clustering of data without having to specify apriori the number of clusters.
> DP means is a non-parametric clustering algorithm that uses a scale parameter 'lambda' to control the creation of new clusters ["Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis, Michael I. Jordan].
> We have followed the distributed implementation of DP means which has been proposed in the paper titled "MLbase: Distributed Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre Wibisono.
> A benchmark comparison between k-means and dp-means based on Normalized Mutual Information between ground truth clusters and algorithm outputs, have been provided in the following table. It can be seen from the table that DP-means reported a higher NMI on 5 of 8 data sets in comparison to k-means[Source: Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics (2011) Arxiv:1111.0352. (Table 1)]
> | Dataset | DP-means | k-means |
> | Wine | .41 | .43 |
> | Iris | .75 | .76 |
> | Pima | .02 | .03 |
> | Soybean | .72 | .66 |
> | Car | .07 | .05 |
> | Balance Scale | .17 | .11 |
> | Breast Cancer | .04 | .03 |
> | Vehicle | .18 | .18 |
> Experiment on our spark cluster setup:
> An initial benchmark study was performed on a 3 node Spark cluster setup on mesos where each node config was 8 Cores, 64 GB RAM and the spark version used was 1.5(git branch).
> Tests were done using a mixture of 10 Gaussians with varying number of features and instances. The results from the benchmark study are provided below. The reported stats are average over 5 runs.
> | DATASET | | DPMEANS | | | KMEANS (k =10) | |
> | Instances | Dimensions | No of clusters obtained | Time | Converged in iterations | Time | Converged in iterations |
> | 10 million | 10 | 10 | 43.6s | 2 | 52.2s | 2 |
> | 1 million | 100 | 10 | 39.8s | 2 | 43.39s | 2 |
> | 0.1 million | 1000 | 10 | 37.3s | 2 | 41.64s | 2 |
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org