You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Julian Jorczik (Jira)" <ji...@apache.org> on 2021/03/08 13:46:00 UTC

[jira] [Created] (SPARK-34664) Provide silhouette score for each sample when using ClusteringEvaluator

Julian Jorczik created SPARK-34664:
--------------------------------------

             Summary: Provide silhouette score for each sample when using ClusteringEvaluator
                 Key: SPARK-34664
                 URL: https://issues.apache.org/jira/browse/SPARK-34664
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.1.1
            Reporter: Julian Jorczik


Computing the average silhouette score is already implemented when using ClusteringEvaluator. When looking at the [source code|https://gitlab.com/mark91/SparkClusteringEvaluationMetrics/-/blob/master/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouetteEvaluator.scala] of ClusteringEvaluator, I think it would be easy to provide not only the average silhouette score but also the silhouette score for each sample, as they are already computed (Line 95-99).
 The silhouette score for each sample can be helpful to generate a silhouette plot for instance as described in [this scikit-learn article|https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html]. The resulting feature would be equivalent to the silhouette_samples function implemented in scikit-learn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org