You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2014/11/20 10:21:34 UTC

[jira] [Commented] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

    [ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219163#comment-14219163 ] 

Apache Spark commented on SPARK-4510:
-------------------------------------

User 'fjiang6' has created a pull request for this issue:
https://github.com/apache/spark/pull/3382

> Add k-medoids Partitioning Around Medoids (PAM) algorithm
> ---------------------------------------------------------
>
>                 Key: SPARK-4510
>                 URL: https://issues.apache.org/jira/browse/SPARK-4510
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Fan Jiang
>              Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster.
> The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows:
> Initialize: randomly select (without replacement) k of the n data points as the medoids
> Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance)
> For each medoid m
> For each non-medoid data point o
> Swap m and o and compute the total cost of the configuration
> Select the configuration with the lowest cost.
> Repeat steps 2 to 4 until there is no change in the medoid.
> The new feature for MLlib will contain 5 new files
> /main/scala/org/apache/spark/mllib/clustering/PAM.scala
> /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
> /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
> /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
> /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org