You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Meethu Mathew (JIRA)" <ji...@apache.org> on 2014/09/18 12:47:33 UTC
[jira] [Created] (SPARK-3588) Gaussian Mixture Model clustering

Meethu Mathew created SPARK-3588:
------------------------------------

             Summary: Gaussian Mixture Model clustering
                 Key: SPARK-3588
                 URL: https://issues.apache.org/jira/browse/SPARK-3588
             Project: Spark
          Issue Type: New Feature
          Components: MLlib, PySpark
            Reporter: Meethu Mathew


Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight π. In this technique, probability of  each point to belong to each cluster is computed along with the cluster statistics.

We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the  Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component.

We did an initial benchmark study on a  2 node Spark standalone cluster setup where each node config is(8 Cores,8 GB RAM) and the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets.
Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances.

||&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Dataset  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||&nbsp;&nbsp;&nbsp;Gaussian mixture model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Kmeans(Python)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;||         

|Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg time per iteration |Time for 100 iterations | 

|0.7million| &nbsp;&nbsp;&nbsp;13 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   7s &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     12min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     13s  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  &nbsp;&nbsp;&nbsp;&nbsp;    26min &nbsp;&nbsp;&nbsp;    |

|1.8million| &nbsp;&nbsp;&nbsp;11 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  17s &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     29min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |  &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;     33s  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp;    53min &nbsp;&nbsp;&nbsp;  |

|10million|&nbsp;&nbsp;&nbsp;16 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1.6min &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     2.7hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   |  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     1.2min &nbsp;&nbsp;&nbsp;&nbsp;    |  &nbsp;&nbsp;&nbsp;&nbsp;    2hr &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org