You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/05/31 12:23:00 UTC

[jira] [Commented] (SPARK-27896) Fix definition of clustering silhouette coefficient for 1-element clusters

    [ https://issues.apache.org/jira/browse/SPARK-27896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852962#comment-16852962 ] 

Sean Owen commented on SPARK-27896:
-----------------------------------

Copying follow up from email:

Yes the paper does say the silhouette is 0 in this case. That's an
argument to change it.

On the other hand, I am not sure if I agree with the paper here. If A
consists of one point, then that point's assignment is optimal in a
sense. Setting the silhouette to 0 indicates that assigning it to B,
which is a cluster of more distant points, is just as good. I don't
think that makes as much sense as 1, which it returns now.

You could argue that silhouette is specifically penalizing, in a way,
this type of assignment in a way that Euclidean distance does not.
Wikipedia's definition follows the paper:
https://en.wikipedia.org/wiki/Silhouette_(clustering)
It looks like sklearn also follows the paper's definition:
https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/metrics/cluster/unsupervised.py#L235

> Fix definition of clustering silhouette coefficient for 1-element clusters
> --------------------------------------------------------------------------
>
>                 Key: SPARK-27896
>                 URL: https://issues.apache.org/jira/browse/SPARK-27896
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.3
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>
> Reported by Samuel Kubler via email:
> In the code https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala, I think there is a little mistake in the class “Silhouette” when you calculate the Silhouette coefficient for a point. Indeed, according to the scientific paper of reference “Silhouettes:  a graphical aid to the interpretation and validation of cluster analysis” Peter J. ROUSSEEUW 1986, for the points which are alone in a cluster it is not the currentClusterDissimilarity  which is supposed to be equal to 0 like it is the case in your code (“val currentClusterDissimilarity = if (pointClusterNumOfPoints == 1) {0.0}” but the silhouette coefficient itself. Indeed, “When cluster A contains only a single object it is unclear how a(i) should be defined, and the we simply set s(i) equal to zero”.
> The problem of defining the currentClusterDissimilarity to zero like you have done is that you can’t use the silhouette coefficient anymore as a criterion to determine the optimal value of the number of clusters in your clustering process because your algorithm will answer that the more clusters you have, the better will be your clustering algorithm. Indeed, in that case when the number of clustering classes increases, s(i) converges toward 1. (so your algorithm seems to be more efficient) I have, beside, check this result of my own clustering example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org