You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Meethu Mathew (JIRA)" <ji...@apache.org> on 2014/10/01 08:30:33 UTC
[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154434#comment-14154434 ]
Meethu Mathew commented on SPARK-3588:
--------------------------------------
Ok. We will start implementing the Scala version of Gaussian Mixture Model.
> Gaussian Mixture Model clustering
> ---------------------------------
>
> Key: SPARK-3588
> URL: https://issues.apache.org/jira/browse/SPARK-3588
> Project: Spark
> Issue Type: New Feature
> Components: MLlib, PySpark
> Reporter: Meethu Mathew
> Assignee: Meethu Mathew
> Attachments: GMMSpark.py
>
>
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component.
> We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets.
> Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances.
> || Dataset || Gaussian mixture model || Kmeans(Python) ||
> |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg time per iteration |Time for 100 iterations |
> |0.7million| 13 | 7s | 12min | 13s | 26min |
> |1.8million| 11 | 17s | 29min | 33s | 53min |
> |10million| 16 | 1.6min | 2.7hr | 1.2min | 2hr |
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org