You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2017/07/05 07:51:08 UTC

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/18538

    [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette.

    
    ## What changes were proposed in this pull request?
    
    This PR adds the ClusteringEvaluator Evaluator which contains two metrics:
     - **cosineSilhouette**: the Silhouette measure using the cosine distance;
     - **squaredSilhouette**: the Silhouette measure using the squared Euclidean distance.
    
    The implementation of the two metrics refers to the algorithm proposed and explained [here](https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view). These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition.
    
    ## How was this patch tested?
    
    The patch has been tested with the additional unit tests added (comparing the results with the ones provided by [Python sklearn library](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)).


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-14516

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18538.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18538
    
----
commit 64e17b4db825f85ee19d30ca38cba887633c6900
Author: Marco Gaido <mg...@hortonworks.com>
Date:   2017-06-30T15:05:17Z

    [SPARK-14516] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80860/testReport)** for PR 18538 at commit [`a4ca3cd`](https://github.com/apache/spark/commit/a4ca3cd18852abc8076905a586c6b0f4b622cff6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133476670
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,225 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    --- End diff --
    
    @mgaido91 Sorry I mistakenly thought to put it in the src resource rather than test resource. Usually we generate some dataset to verify MLlib result, we never put existing dataset in resource even test scope until now, but the iris dataset is so popular and can be used to verify lots of algorithms, so I'm OK to put it there. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133370279
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    --- End diff --
    
    remove empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80285/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @yanboliang @mgaido91  I just saw this PR.  It creates a new test data directory.  Could you please send a quite update to move the data to the existing data directory: https://github.com/apache/spark/tree/master/data/mllib ?  Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81639/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r134449164
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val result = BigDecimal(0.6564679231)
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +    val actual = BigDecimal(evaluator.evaluate(iris))
    +      .setScale(10, BigDecimal.RoundingMode.HALF_UP)
    +
    +    assertResult(result)(actual)
    +  }
    +
    +}
    +
    +object ClusteringEvaluatorSuite {
    +  def irisDataset(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +
    +    val irisCsvPath = Thread.currentThread()
    +      .getContextClassLoader
    +      .getResource("test-data/iris.csv")
    +      .toString
    --- End diff --
    
    So this testsuite reference another testdata file. Can we generate the testdata in the code? Like other testsuite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137178071
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    --- End diff --
    
    I'd suggest to change ```d(X, C_{i} )^2``` to ```d(X, C_{i} )```, as we don't define ```d()``` for _Euclidean distance_, so we can regard it as _squared Euclidean distance_ . What do you think of?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133370353
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    --- End diff --
    
    remove empty line, and otherwise


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81639/testReport)** for PR 18538 at commit [`b0b7853`](https://github.com/apache/spark/commit/b0b7853d68c1c79bd49d6e290d3c96fe9e3af6ea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    the build error is caused by some file being present unexpectedly for SparkR. This issue is unrelated with the PR (which doesn't even affect SparkR). I am not sure whether someone is working on the CI infra or which is the root cause of the error. Has anybody an idea? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133745930
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    --- End diff --
    
    I included the link to the design document here: https://github.com/mgaido91/spark/blob/ffc17f929dd86d1e7e73931eac5663bc08b6ba7a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala#L37. Should I move it from there? Or should I rewrite the content of the document in an annotation here? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133157997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    +  }
    +
    +  def computeSquaredSilhouette(dataset: Dataset[_],
    +    predictionCol: String,
    --- End diff --
    
    The indentation should be four spaces in this and the following lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133744052
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    --- End diff --
    
    I can't see in the wiki any of the other evaluators. And I don't see a detailed explanation of the maths behind the algorithms either. Thus I am not sure it is the best place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80453/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80285/testReport)** for PR 18538 at commit [`923418a`](https://github.com/apache/spark/commit/923418a7139e9cd038882499e7ac0aa544a14858).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137251320
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    --- End diff --
    
    No, `x_{i}` is not a vector. `X` is a vector (which represents a point). `x_{i}` is a typo I am fixing for `x_{j}` which is a scalar, not a vector.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80862 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80862/testReport)** for PR 18538 at commit [`a7db896`](https://github.com/apache/spark/commit/a7db8962745bd000da0737018eef4b1680425c90).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131868145
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,171 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.ml.linalg.{Vector, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.Dataset
    +import org.apache.spark.sql.functions.{avg, col}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): Evaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default), `"cosineSilhouette"`)
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette", "cosineSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette|cosineSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        computeSquaredSilhouette(dataset)
    +      case "cosineSilhouette" =>
    +        computeCosineSilhouette(dataset)
    +    }
    +    metric
    +  }
    +
    +  private[this] def computeCosineSilhouette(dataset: Dataset[_]): Double = {
    +    CosineSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val computeCsi = dataset.sparkSession.udf.register("computeCsi",
    --- End diff --
    
    Could we use more descriptive name? We can't get what this function does from its name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133176185
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    --- End diff --
    
    Add ```@Since("2.3.0")``` here and other places necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80862/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137239650
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    --- End diff --
    
    ```@Since("2.3.0") ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133158077
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    +  }
    +
    +  def computeSquaredSilhouette(dataset: Dataset[_],
    +    predictionCol: String,
    +    featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNorm = udf {
    +      features: Vector =>
    +        math.pow(Vectors.norm(features, 2.0), 2.0)
    --- End diff --
    
    Move this line to above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @yanboliang thanks for your review.
    I refactored the code according to your suggestions and I removed the cosine implementation.
    Might you please review it now?
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133182964
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    --- End diff --
    
    Should I introduce then a new param for the distance metric? I think it is important to highlight that the used distance measure is the squared Euclidean distance, because anybody would assume that the Euclidean distance is used, if we don't specify it very well IMHO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137239744
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    --- End diff --
    
    ```@Since("2.3.0")```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80860/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131892095
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.{col, count, sum}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    --- End diff --
    
    Let's move this to file ```ClusteringEvaluator```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133175654
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    --- End diff --
    
    ```clusterCurrentPoint``` -> ```currentCluster```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @jkbradley I am not sure that we should put the data for tests of the ml package in the mllib package. Is this the right approach?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r134456779
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val result = BigDecimal(0.6564679231)
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +    val actual = BigDecimal(evaluator.evaluate(iris))
    +      .setScale(10, BigDecimal.RoundingMode.HALF_UP)
    +
    +    assertResult(result)(actual)
    +  }
    +
    +}
    +
    +object ClusteringEvaluatorSuite {
    +  def irisDataset(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +
    +    val irisCsvPath = Thread.currentThread()
    +      .getContextClassLoader
    +      .getResource("test-data/iris.csv")
    +      .toString
    --- End diff --
    
    There was a discussion about this in the outdated comments. The main reason to avoid test data generation in my point of view is that the generated data must be clustered before running the Silhouette.
    The iris dataset is a well-known one and contains already clustered data. Thus it seemed the best option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137239478
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    --- End diff --
    
    ```class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80281/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81287/testReport)** for PR 18538 at commit [`45d1380`](https://github.com/apache/spark/commit/45d1380574ece58ff63c34ff31af6243aff16c3c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80453 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80453/testReport)** for PR 18538 at commit [`ffc17f9`](https://github.com/apache/spark/commit/ffc17f929dd86d1e7e73931eac5663bc08b6ba7a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131891836
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/CosineSilhouette.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{DenseVector, Vector, VectorElementWiseSum}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.{col, count}
    +
    +private[evaluation] object CosineSilhouette {
    --- End diff --
    
    There is no clustering algorithms using other distance metrics except for squared euclidean distance currently. I'd suggest to remove the ```CosineSilhouette``` implementation firstly, we can add it back when it's needed. This can also make this PR more easy to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138090203
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    // Silhouette is reasonable only when the number of clusters is grater then 1
    +    assert(dataset.select($(predictionCol)).distinct().count() > 1,
    +      "Number of clusters must be greater than one.")
    +
    +    $(metricName) match {
    +      case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore(
    +        dataset,
    +        $(predictionCol),
    +        $(featuresCol)
    +      )
    +    }
    +  }
    +}
    +
    +
    +@Since("2.3.0")
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  @Since("2.3.0")
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of `i` to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the total distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} ) =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{j}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * and the average distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute: the constant `$\xi_{X}$` for each point `X`; the
    + * constants `$\Psi_{\Gamma}$`, `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (!kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `$\Psi_{\Gamma}$` and `$Y_{\Gamma}$` for a cluster).
    +   */
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  /**
    +   * It computes the Silhouette coefficient for a point.
    +   *
    +   * @param broadcastedClustersMap A map of the precomputed values for each cluster.
    +   * @param features The [[org.apache.spark.ml.linalg.Vector]] representing the current point.
    +   * @param clusterId The id of the cluster the current point belongs to.
    +   * @param squaredNorm The `$\Xi_{X}$` (which is the squared norm) precomputed for the point.
    +   * @return The Silhouette for the point.
    +   */
    +  def computeSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     features: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    // Here we compute the average dissimilarity of the
    +    // current point to any cluster of which the point
    +    // is not a member.
    +    // The cluster with the lowest average dissimilarity
    +    // - i.e. the nearest cluster to the current point -
    +    // is said to be the "neighboring cluster".
    +    var neighboringClusterDissimilarity = Double.MaxValue
    +    broadcastedClustersMap.value.keySet.foreach {
    +      c =>
    +        if (c != clusterId) {
    +          val dissimilarity = compute(squaredNorm, features, broadcastedClustersMap.value(c))
    +          if(dissimilarity < neighboringClusterDissimilarity) {
    +            neighboringClusterDissimilarity = dissimilarity
    +          }
    +        }
    +    }
    +    val currentCluster = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val currentClusterDissimilarity = if (currentCluster.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, features, currentCluster) * currentCluster.numOfPoints /
    +        (currentCluster.numOfPoints - 1)
    +    }
    +
    +    (currentClusterDissimilarity compare neighboringClusterDissimilarity).signum match {
    +      case -1 => 1 - (currentClusterDissimilarity / neighboringClusterDissimilarity)
    +      case 1 => (neighboringClusterDissimilarity / currentClusterDissimilarity) - 1
    +      case 0 => 0.0
    +    }
    +  }
    +
    +  /**
    +   * Compute the mean Silhouette values of all samples.
    +   *
    +   * @param dataset The input dataset (previously clustered) on which compute the Silhouette.
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return The average of the Silhouette values of the clustered data.
    +   */
    +  def computeSilhouetteScore(
    +      dataset: Dataset[_],
    +      predictionCol: String,
    +      featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNormUDF = udf {
    +      features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNormUDF(col(featuresCol)))
    +
    +    // compute aggregate values for clusters needed by the algorithm
    +    val clustersStatsMap = SquaredEuclideanSilhouette
    +      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
    --- End diff --
    
    this comment has been addressed just one line after. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133159218
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    +  }
    +
    +  def computeSquaredSilhouette(dataset: Dataset[_],
    +    predictionCol: String,
    +    featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNorm = udf {
    +      features: Vector =>
    +        math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNorm(col(featuresCol)))
    +
    +    // compute aggregate values for clusters
    +    // needed by the algorithm
    +    val clustersStatsMap = SquaredEuclideanSilhouette
    +      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
    +
    +    val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
    +
    +    val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",
    +      computeSquaredSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Int, _: Double)
    +    )
    +
    +    val squaredSilhouetteDF = dfWithSquaredNorm
    --- End diff --
    
    ```squaredSilhouetteDF``` -> ```silhouetteCoefficientDF```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133177805
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    --- End diff --
    
    It's better to have some annotation to explain how we compute ```Silhouette Coefficient``` by the high efficient distributed implementation. You can refer what we did at [LogisticRegression](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133875990
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    --- End diff --
    
    Usually we should paste the formula here to explain how we compute ```Silhouette Coefficient``` by the high efficient distributed implementation. Because your design document is not a publication, so I think we need to move it from there, but you can simplify it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81463/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137278367
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +
    +    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
    +  }
    +
    --- End diff --
    
    Yeah, I support to keep consistent result. Otherwise, any real value is a confused result. What do you think of it? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133159306
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    --- End diff --
    
    ```vector``` -> ```features``` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133370197
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    --- End diff --
    
    What about changing this to 
    ```
    if (clusterSil < minOther) {
    1 - clusterSil / minOther
    } else if (clusterSil > minOther) {
    minOther / clusterSil - 1
    } else {
    0.0
    }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131100309
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,235 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.0, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.7, 3.2, 1.3, 0.2), 0),
    +      Row(Vectors.dense(4.6, 3.1, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.6, 1.4, 0.2), 0),
    +      Row(Vectors.dense(5.4, 3.9, 1.7, 0.4), 0),
    +      Row(Vectors.dense(4.6, 3.4, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.0, 3.4, 1.5, 0.2), 0),
    +      Row(Vectors.dense(4.4, 2.9, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.4, 3.7, 1.5, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.4, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.0, 1.4, 0.1), 0),
    +      Row(Vectors.dense(4.3, 3.0, 1.1, 0.1), 0),
    +      Row(Vectors.dense(5.8, 4.0, 1.2, 0.2), 0),
    +      Row(Vectors.dense(5.7, 4.4, 1.5, 0.4), 0),
    +      Row(Vectors.dense(5.4, 3.9, 1.3, 0.4), 0),
    +      Row(Vectors.dense(5.1, 3.5, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.7, 3.8, 1.7, 0.3), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.5, 0.3), 0),
    +      Row(Vectors.dense(5.4, 3.4, 1.7, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.7, 1.5, 0.4), 0),
    +      Row(Vectors.dense(4.6, 3.6, 1.0, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.3, 1.7, 0.5), 0),
    +      Row(Vectors.dense(4.8, 3.4, 1.9, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.0, 1.6, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.4, 1.6, 0.4), 0),
    +      Row(Vectors.dense(5.2, 3.5, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.2, 3.4, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.7, 3.2, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.1, 1.6, 0.2), 0),
    +      Row(Vectors.dense(5.4, 3.4, 1.5, 0.4), 0),
    +      Row(Vectors.dense(5.2, 4.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.5, 4.2, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.0, 3.2, 1.2, 0.2), 0),
    +      Row(Vectors.dense(5.5, 3.5, 1.3, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(4.4, 3.0, 1.3, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.4, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.5, 1.3, 0.3), 0),
    +      Row(Vectors.dense(4.5, 2.3, 1.3, 0.3), 0),
    +      Row(Vectors.dense(4.4, 3.2, 1.3, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.5, 1.6, 0.6), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.9, 0.4), 0),
    +      Row(Vectors.dense(4.8, 3.0, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.6, 3.2, 1.4, 0.2), 0),
    +      Row(Vectors.dense(5.3, 3.7, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.3, 1.4, 0.2), 0),
    +      Row(Vectors.dense(7.0, 3.2, 4.7, 1.4), 1),
    +      Row(Vectors.dense(6.4, 3.2, 4.5, 1.5), 1),
    +      Row(Vectors.dense(6.9, 3.1, 4.9, 1.5), 1),
    +      Row(Vectors.dense(5.5, 2.3, 4.0, 1.3), 1),
    +      Row(Vectors.dense(6.5, 2.8, 4.6, 1.5), 1),
    +      Row(Vectors.dense(5.7, 2.8, 4.5, 1.3), 1),
    +      Row(Vectors.dense(6.3, 3.3, 4.7, 1.6), 1),
    +      Row(Vectors.dense(4.9, 2.4, 3.3, 1.0), 1),
    +      Row(Vectors.dense(6.6, 2.9, 4.6, 1.3), 1),
    +      Row(Vectors.dense(5.2, 2.7, 3.9, 1.4), 1),
    +      Row(Vectors.dense(5.0, 2.0, 3.5, 1.0), 1),
    +      Row(Vectors.dense(5.9, 3.0, 4.2, 1.5), 1),
    +      Row(Vectors.dense(6.0, 2.2, 4.0, 1.0), 1),
    +      Row(Vectors.dense(6.1, 2.9, 4.7, 1.4), 1),
    +      Row(Vectors.dense(5.6, 2.9, 3.6, 1.3), 1),
    +      Row(Vectors.dense(6.7, 3.1, 4.4, 1.4), 1),
    +      Row(Vectors.dense(5.6, 3.0, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.8, 2.7, 4.1, 1.0), 1),
    +      Row(Vectors.dense(6.2, 2.2, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.6, 2.5, 3.9, 1.1), 1),
    +      Row(Vectors.dense(5.9, 3.2, 4.8, 1.8), 1),
    +      Row(Vectors.dense(6.1, 2.8, 4.0, 1.3), 1),
    +      Row(Vectors.dense(6.3, 2.5, 4.9, 1.5), 1),
    +      Row(Vectors.dense(6.1, 2.8, 4.7, 1.2), 1),
    +      Row(Vectors.dense(6.4, 2.9, 4.3, 1.3), 1),
    +      Row(Vectors.dense(6.6, 3.0, 4.4, 1.4), 1),
    +      Row(Vectors.dense(6.8, 2.8, 4.8, 1.4), 1),
    +      Row(Vectors.dense(6.7, 3.0, 5.0, 1.7), 1),
    +      Row(Vectors.dense(6.0, 2.9, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.7, 2.6, 3.5, 1.0), 1),
    +      Row(Vectors.dense(5.5, 2.4, 3.8, 1.1), 1),
    +      Row(Vectors.dense(5.5, 2.4, 3.7, 1.0), 1),
    +      Row(Vectors.dense(5.8, 2.7, 3.9, 1.2), 1),
    +      Row(Vectors.dense(6.0, 2.7, 5.1, 1.6), 1),
    +      Row(Vectors.dense(5.4, 3.0, 4.5, 1.5), 1),
    +      Row(Vectors.dense(6.0, 3.4, 4.5, 1.6), 1),
    +      Row(Vectors.dense(6.7, 3.1, 4.7, 1.5), 1),
    +      Row(Vectors.dense(6.3, 2.3, 4.4, 1.3), 1),
    +      Row(Vectors.dense(5.6, 3.0, 4.1, 1.3), 1),
    +      Row(Vectors.dense(5.5, 2.5, 4.0, 1.3), 1),
    +      Row(Vectors.dense(5.5, 2.6, 4.4, 1.2), 1),
    +      Row(Vectors.dense(6.1, 3.0, 4.6, 1.4), 1),
    +      Row(Vectors.dense(5.8, 2.6, 4.0, 1.2), 1),
    +      Row(Vectors.dense(5.0, 2.3, 3.3, 1.0), 1),
    +      Row(Vectors.dense(5.6, 2.7, 4.2, 1.3), 1),
    +      Row(Vectors.dense(5.7, 3.0, 4.2, 1.2), 1),
    +      Row(Vectors.dense(5.7, 2.9, 4.2, 1.3), 1),
    +      Row(Vectors.dense(6.2, 2.9, 4.3, 1.3), 1),
    +      Row(Vectors.dense(5.1, 2.5, 3.0, 1.1), 1),
    +      Row(Vectors.dense(5.7, 2.8, 4.1, 1.3), 1),
    +      Row(Vectors.dense(6.3, 3.3, 6.0, 2.5), 2),
    +      Row(Vectors.dense(5.8, 2.7, 5.1, 1.9), 2),
    +      Row(Vectors.dense(7.1, 3.0, 5.9, 2.1), 2),
    +      Row(Vectors.dense(6.3, 2.9, 5.6, 1.8), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.8, 2.2), 2),
    +      Row(Vectors.dense(7.6, 3.0, 6.6, 2.1), 2),
    +      Row(Vectors.dense(4.9, 2.5, 4.5, 1.7), 2),
    +      Row(Vectors.dense(7.3, 2.9, 6.3, 1.8), 2),
    +      Row(Vectors.dense(6.7, 2.5, 5.8, 1.8), 2),
    +      Row(Vectors.dense(7.2, 3.6, 6.1, 2.5), 2),
    +      Row(Vectors.dense(6.5, 3.2, 5.1, 2.0), 2),
    +      Row(Vectors.dense(6.4, 2.7, 5.3, 1.9), 2),
    +      Row(Vectors.dense(6.8, 3.0, 5.5, 2.1), 2),
    +      Row(Vectors.dense(5.7, 2.5, 5.0, 2.0), 2),
    +      Row(Vectors.dense(5.8, 2.8, 5.1, 2.4), 2),
    +      Row(Vectors.dense(6.4, 3.2, 5.3, 2.3), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.5, 1.8), 2),
    +      Row(Vectors.dense(7.7, 3.8, 6.7, 2.2), 2),
    +      Row(Vectors.dense(7.7, 2.6, 6.9, 2.3), 2),
    +      Row(Vectors.dense(6.0, 2.2, 5.0, 1.5), 2),
    +      Row(Vectors.dense(6.9, 3.2, 5.7, 2.3), 2),
    +      Row(Vectors.dense(5.6, 2.8, 4.9, 2.0), 2),
    +      Row(Vectors.dense(7.7, 2.8, 6.7, 2.0), 2),
    +      Row(Vectors.dense(6.3, 2.7, 4.9, 1.8), 2),
    +      Row(Vectors.dense(6.7, 3.3, 5.7, 2.1), 2),
    +      Row(Vectors.dense(7.2, 3.2, 6.0, 1.8), 2),
    +      Row(Vectors.dense(6.2, 2.8, 4.8, 1.8), 2),
    +      Row(Vectors.dense(6.1, 3.0, 4.9, 1.8), 2),
    +      Row(Vectors.dense(6.4, 2.8, 5.6, 2.1), 2),
    +      Row(Vectors.dense(7.2, 3.0, 5.8, 1.6), 2),
    +      Row(Vectors.dense(7.4, 2.8, 6.1, 1.9), 2),
    +      Row(Vectors.dense(7.9, 3.8, 6.4, 2.0), 2),
    +      Row(Vectors.dense(6.4, 2.8, 5.6, 2.2), 2),
    +      Row(Vectors.dense(6.3, 2.8, 5.1, 1.5), 2),
    +      Row(Vectors.dense(6.1, 2.6, 5.6, 1.4), 2),
    +      Row(Vectors.dense(7.7, 3.0, 6.1, 2.3), 2),
    +      Row(Vectors.dense(6.3, 3.4, 5.6, 2.4), 2),
    +      Row(Vectors.dense(6.4, 3.1, 5.5, 1.8), 2),
    +      Row(Vectors.dense(6.0, 3.0, 4.8, 1.8), 2),
    +      Row(Vectors.dense(6.9, 3.1, 5.4, 2.1), 2),
    +      Row(Vectors.dense(6.7, 3.1, 5.6, 2.4), 2),
    +      Row(Vectors.dense(6.9, 3.1, 5.1, 2.3), 2),
    +      Row(Vectors.dense(5.8, 2.7, 5.1, 1.9), 2),
    +      Row(Vectors.dense(6.8, 3.2, 5.9, 2.3), 2),
    +      Row(Vectors.dense(6.7, 3.3, 5.7, 2.5), 2),
    +      Row(Vectors.dense(6.7, 3.0, 5.2, 2.3), 2),
    +      Row(Vectors.dense(6.3, 2.5, 5.0, 1.9), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.2, 2.0), 2),
    +      Row(Vectors.dense(6.2, 3.4, 5.4, 2.3), 2),
    +      Row(Vectors.dense(5.9, 3.0, 5.1, 1.8), 2))
    +
    +  val dsStruct = StructType( Seq(
    +    StructField("point", new VectorUDT, nullable = false),
    +    StructField("label", IntegerType, nullable = false)
    +  ))
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new RegressionEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +      .setMetricName("cosineSilhouette")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  test("squared euclidean Silhouette") {
    --- End diff --
    
    Could you add Python code which can help to reproduce the result in scikit-learn, like we did in [other algorithms](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala#L236)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131121892
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,235 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.0, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.7, 3.2, 1.3, 0.2), 0),
    +      Row(Vectors.dense(4.6, 3.1, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.6, 1.4, 0.2), 0),
    +      Row(Vectors.dense(5.4, 3.9, 1.7, 0.4), 0),
    +      Row(Vectors.dense(4.6, 3.4, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.0, 3.4, 1.5, 0.2), 0),
    +      Row(Vectors.dense(4.4, 2.9, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.4, 3.7, 1.5, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.4, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.0, 1.4, 0.1), 0),
    +      Row(Vectors.dense(4.3, 3.0, 1.1, 0.1), 0),
    +      Row(Vectors.dense(5.8, 4.0, 1.2, 0.2), 0),
    +      Row(Vectors.dense(5.7, 4.4, 1.5, 0.4), 0),
    +      Row(Vectors.dense(5.4, 3.9, 1.3, 0.4), 0),
    +      Row(Vectors.dense(5.1, 3.5, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.7, 3.8, 1.7, 0.3), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.5, 0.3), 0),
    +      Row(Vectors.dense(5.4, 3.4, 1.7, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.7, 1.5, 0.4), 0),
    +      Row(Vectors.dense(4.6, 3.6, 1.0, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.3, 1.7, 0.5), 0),
    +      Row(Vectors.dense(4.8, 3.4, 1.9, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.0, 1.6, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.4, 1.6, 0.4), 0),
    +      Row(Vectors.dense(5.2, 3.5, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.2, 3.4, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.7, 3.2, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.1, 1.6, 0.2), 0),
    +      Row(Vectors.dense(5.4, 3.4, 1.5, 0.4), 0),
    +      Row(Vectors.dense(5.2, 4.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.5, 4.2, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.0, 3.2, 1.2, 0.2), 0),
    +      Row(Vectors.dense(5.5, 3.5, 1.3, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(4.4, 3.0, 1.3, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.4, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.5, 1.3, 0.3), 0),
    +      Row(Vectors.dense(4.5, 2.3, 1.3, 0.3), 0),
    +      Row(Vectors.dense(4.4, 3.2, 1.3, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.5, 1.6, 0.6), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.9, 0.4), 0),
    +      Row(Vectors.dense(4.8, 3.0, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.6, 3.2, 1.4, 0.2), 0),
    +      Row(Vectors.dense(5.3, 3.7, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.3, 1.4, 0.2), 0),
    +      Row(Vectors.dense(7.0, 3.2, 4.7, 1.4), 1),
    +      Row(Vectors.dense(6.4, 3.2, 4.5, 1.5), 1),
    +      Row(Vectors.dense(6.9, 3.1, 4.9, 1.5), 1),
    +      Row(Vectors.dense(5.5, 2.3, 4.0, 1.3), 1),
    +      Row(Vectors.dense(6.5, 2.8, 4.6, 1.5), 1),
    +      Row(Vectors.dense(5.7, 2.8, 4.5, 1.3), 1),
    +      Row(Vectors.dense(6.3, 3.3, 4.7, 1.6), 1),
    +      Row(Vectors.dense(4.9, 2.4, 3.3, 1.0), 1),
    +      Row(Vectors.dense(6.6, 2.9, 4.6, 1.3), 1),
    +      Row(Vectors.dense(5.2, 2.7, 3.9, 1.4), 1),
    +      Row(Vectors.dense(5.0, 2.0, 3.5, 1.0), 1),
    +      Row(Vectors.dense(5.9, 3.0, 4.2, 1.5), 1),
    +      Row(Vectors.dense(6.0, 2.2, 4.0, 1.0), 1),
    +      Row(Vectors.dense(6.1, 2.9, 4.7, 1.4), 1),
    +      Row(Vectors.dense(5.6, 2.9, 3.6, 1.3), 1),
    +      Row(Vectors.dense(6.7, 3.1, 4.4, 1.4), 1),
    +      Row(Vectors.dense(5.6, 3.0, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.8, 2.7, 4.1, 1.0), 1),
    +      Row(Vectors.dense(6.2, 2.2, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.6, 2.5, 3.9, 1.1), 1),
    +      Row(Vectors.dense(5.9, 3.2, 4.8, 1.8), 1),
    +      Row(Vectors.dense(6.1, 2.8, 4.0, 1.3), 1),
    +      Row(Vectors.dense(6.3, 2.5, 4.9, 1.5), 1),
    +      Row(Vectors.dense(6.1, 2.8, 4.7, 1.2), 1),
    +      Row(Vectors.dense(6.4, 2.9, 4.3, 1.3), 1),
    +      Row(Vectors.dense(6.6, 3.0, 4.4, 1.4), 1),
    +      Row(Vectors.dense(6.8, 2.8, 4.8, 1.4), 1),
    +      Row(Vectors.dense(6.7, 3.0, 5.0, 1.7), 1),
    +      Row(Vectors.dense(6.0, 2.9, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.7, 2.6, 3.5, 1.0), 1),
    +      Row(Vectors.dense(5.5, 2.4, 3.8, 1.1), 1),
    +      Row(Vectors.dense(5.5, 2.4, 3.7, 1.0), 1),
    +      Row(Vectors.dense(5.8, 2.7, 3.9, 1.2), 1),
    +      Row(Vectors.dense(6.0, 2.7, 5.1, 1.6), 1),
    +      Row(Vectors.dense(5.4, 3.0, 4.5, 1.5), 1),
    +      Row(Vectors.dense(6.0, 3.4, 4.5, 1.6), 1),
    +      Row(Vectors.dense(6.7, 3.1, 4.7, 1.5), 1),
    +      Row(Vectors.dense(6.3, 2.3, 4.4, 1.3), 1),
    +      Row(Vectors.dense(5.6, 3.0, 4.1, 1.3), 1),
    +      Row(Vectors.dense(5.5, 2.5, 4.0, 1.3), 1),
    +      Row(Vectors.dense(5.5, 2.6, 4.4, 1.2), 1),
    +      Row(Vectors.dense(6.1, 3.0, 4.6, 1.4), 1),
    +      Row(Vectors.dense(5.8, 2.6, 4.0, 1.2), 1),
    +      Row(Vectors.dense(5.0, 2.3, 3.3, 1.0), 1),
    +      Row(Vectors.dense(5.6, 2.7, 4.2, 1.3), 1),
    +      Row(Vectors.dense(5.7, 3.0, 4.2, 1.2), 1),
    +      Row(Vectors.dense(5.7, 2.9, 4.2, 1.3), 1),
    +      Row(Vectors.dense(6.2, 2.9, 4.3, 1.3), 1),
    +      Row(Vectors.dense(5.1, 2.5, 3.0, 1.1), 1),
    +      Row(Vectors.dense(5.7, 2.8, 4.1, 1.3), 1),
    +      Row(Vectors.dense(6.3, 3.3, 6.0, 2.5), 2),
    +      Row(Vectors.dense(5.8, 2.7, 5.1, 1.9), 2),
    +      Row(Vectors.dense(7.1, 3.0, 5.9, 2.1), 2),
    +      Row(Vectors.dense(6.3, 2.9, 5.6, 1.8), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.8, 2.2), 2),
    +      Row(Vectors.dense(7.6, 3.0, 6.6, 2.1), 2),
    +      Row(Vectors.dense(4.9, 2.5, 4.5, 1.7), 2),
    +      Row(Vectors.dense(7.3, 2.9, 6.3, 1.8), 2),
    +      Row(Vectors.dense(6.7, 2.5, 5.8, 1.8), 2),
    +      Row(Vectors.dense(7.2, 3.6, 6.1, 2.5), 2),
    +      Row(Vectors.dense(6.5, 3.2, 5.1, 2.0), 2),
    +      Row(Vectors.dense(6.4, 2.7, 5.3, 1.9), 2),
    +      Row(Vectors.dense(6.8, 3.0, 5.5, 2.1), 2),
    +      Row(Vectors.dense(5.7, 2.5, 5.0, 2.0), 2),
    +      Row(Vectors.dense(5.8, 2.8, 5.1, 2.4), 2),
    +      Row(Vectors.dense(6.4, 3.2, 5.3, 2.3), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.5, 1.8), 2),
    +      Row(Vectors.dense(7.7, 3.8, 6.7, 2.2), 2),
    +      Row(Vectors.dense(7.7, 2.6, 6.9, 2.3), 2),
    +      Row(Vectors.dense(6.0, 2.2, 5.0, 1.5), 2),
    +      Row(Vectors.dense(6.9, 3.2, 5.7, 2.3), 2),
    +      Row(Vectors.dense(5.6, 2.8, 4.9, 2.0), 2),
    +      Row(Vectors.dense(7.7, 2.8, 6.7, 2.0), 2),
    +      Row(Vectors.dense(6.3, 2.7, 4.9, 1.8), 2),
    +      Row(Vectors.dense(6.7, 3.3, 5.7, 2.1), 2),
    +      Row(Vectors.dense(7.2, 3.2, 6.0, 1.8), 2),
    +      Row(Vectors.dense(6.2, 2.8, 4.8, 1.8), 2),
    +      Row(Vectors.dense(6.1, 3.0, 4.9, 1.8), 2),
    +      Row(Vectors.dense(6.4, 2.8, 5.6, 2.1), 2),
    +      Row(Vectors.dense(7.2, 3.0, 5.8, 1.6), 2),
    +      Row(Vectors.dense(7.4, 2.8, 6.1, 1.9), 2),
    +      Row(Vectors.dense(7.9, 3.8, 6.4, 2.0), 2),
    +      Row(Vectors.dense(6.4, 2.8, 5.6, 2.2), 2),
    +      Row(Vectors.dense(6.3, 2.8, 5.1, 1.5), 2),
    +      Row(Vectors.dense(6.1, 2.6, 5.6, 1.4), 2),
    +      Row(Vectors.dense(7.7, 3.0, 6.1, 2.3), 2),
    +      Row(Vectors.dense(6.3, 3.4, 5.6, 2.4), 2),
    +      Row(Vectors.dense(6.4, 3.1, 5.5, 1.8), 2),
    +      Row(Vectors.dense(6.0, 3.0, 4.8, 1.8), 2),
    +      Row(Vectors.dense(6.9, 3.1, 5.4, 2.1), 2),
    +      Row(Vectors.dense(6.7, 3.1, 5.6, 2.4), 2),
    +      Row(Vectors.dense(6.9, 3.1, 5.1, 2.3), 2),
    +      Row(Vectors.dense(5.8, 2.7, 5.1, 1.9), 2),
    +      Row(Vectors.dense(6.8, 3.2, 5.9, 2.3), 2),
    +      Row(Vectors.dense(6.7, 3.3, 5.7, 2.5), 2),
    +      Row(Vectors.dense(6.7, 3.0, 5.2, 2.3), 2),
    +      Row(Vectors.dense(6.3, 2.5, 5.0, 1.9), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.2, 2.0), 2),
    +      Row(Vectors.dense(6.2, 3.4, 5.4, 2.3), 2),
    +      Row(Vectors.dense(5.9, 3.0, 5.1, 1.8), 2))
    +
    +  val dsStruct = StructType( Seq(
    +    StructField("point", new VectorUDT, nullable = false),
    +    StructField("label", IntegerType, nullable = false)
    +  ))
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new RegressionEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +      .setMetricName("cosineSilhouette")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  test("squared euclidean Silhouette") {
    --- End diff --
    
    Thanks for the reference, I have added it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81666/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133157575
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    +  }
    +
    +  def computeSquaredSilhouette(dataset: Dataset[_],
    --- End diff --
    
    Ditto, rename ```computeSquaredSilhouette``` to ```computeSilhouetteScore```, which should be more clear to let users know this is the _silhouette score_. Meanwhile, could you add annotation for this function like following?
    ```
    /**
     * Compute the mean Silhouette Coefficient of all samples.
     */
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81287/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133158750
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    +  }
    +
    +  def computeSquaredSilhouette(dataset: Dataset[_],
    +    predictionCol: String,
    +    featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNorm = udf {
    +      features: Vector =>
    +        math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNorm(col(featuresCol)))
    +
    +    // compute aggregate values for clusters
    +    // needed by the algorithm
    +    val clustersStatsMap = SquaredEuclideanSilhouette
    +      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
    +
    +    val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
    +
    +    val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",
    --- End diff --
    
    What do you think about to rename ```computeSilhouette``` to ```computeSilhouetteCoefficientUDF```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131891038
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.{col, count, sum}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(Y: Vector, psi: Double, count: Long)
    +
    +  def computeCsi(vector: Vector): Double = {
    +    var sumOfSquares = 0.0
    +    vector.foreachActive((_, v) => {
    +      sumOfSquares += v * v
    +    })
    +    sumOfSquares
    +  }
    +
    +  def computeYVectorPsiAndCount(
    +      df: DataFrame,
    +      predictionCol: String,
    +      featuresCol: String): DataFrame = {
    +    val Yudaf = new VectorElementWiseSum()
    +    df.groupBy(predictionCol)
    +      .agg(
    +        count("*").alias("count"),
    +        sum("csi").alias("psi"),
    +        Yudaf(col(featuresCol)).alias("y")
    --- End diff --
    
    Please rename ```csi``` to ```squaredNorm```, ```psi``` to ```squaredNormSum```, ```y``` to ```featureSum``` if I don't have misunderstand. We should use more descriptive name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133176770
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    --- End diff --
    
    ```squaredSilhouette``` -> ```silhouette```? If we support other distance like cosine, the metric name should be the same. The distance metric should be controlled by other param.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @mgaido91 I made another pass and left some comments, mainly about naming and annotation. This looks in good shape now. I'd suggest to following the name in sklearn, which should be easy to understand for both developers and users. Thanks for this great contribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137239906
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    --- End diff --
    
    ```@Since("2.3.0")``` 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138027427
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    --- End diff --
    
    I'd suggest the metric name is ```silhouette```, since we may add silhouette for other distance, then we can add another param like ```distance``` to control that. The param ```metricName``` should not bind to any distance computation way. There are lots of other metrics for clustering algorithms, like [these](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics) in sklearn. We would not add all of them for MLlib, but we may add part of them in the future.
    cc @jkbradley @MLnick @WeichenXu123 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133176352
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    --- End diff --
    
    ```SquaredEuclideanSilhouette``` -> ```cluEval```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133360284
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    --- End diff --
    
    Yeah, I think we can add a new param for the distance metric in the future. As MLlib only support _squared Euclidean distance_ , we can ignore this param and add annotation in the API to clarify it currently. You can check MLlib ```KMeans```, there is no param to set distance metric. cc @jkbradley @MLnick @hhbyyh 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137250195
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    --- End diff --
    
    yes, you are right, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133368243
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    --- End diff --
    
    It maybe better to refer to the wiki and explain your method in the `ml-clustering.md`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137226104
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute the constant `$\xi_{X}$` for each point `X`
    + * and the constants `$\Psi_{\Gamma}$` and `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `\Psi_{\Gamma}` and `Y_{\Gamma}` for a cluster).
    --- End diff --
    
    ```\Psi_{\Gamma}``` and ```Y_{\Gamma}``` should be surrounded with ```$``` to get correct mathematical symbol in generated doc. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138255937
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,438 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"silhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("silhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (silhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "silhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    --- End diff --
    
    We should support all numeric type for prediction column, not only integer. 
    ```
    SchemaUtils.checkNumericType(schema, $(labelCol))
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137178833
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    --- End diff --
    
    Ditto, ```x_{i}c_{ij}``` -> ```x_{ij}c_{ij}```.
    BTW, could you also check this issue in the following description? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137178736
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    --- End diff --
    
    ```x_{i}c_{ij}``` -> ```x_{ij}c_{ij}```? Since ```x_{i}``` is a vector and ```c_{ij}``` is a double, here we compute dot product.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80281/testReport)** for PR 18538 at commit [`cfcb106`](https://github.com/apache/spark/commit/cfcb106788e5ea2b905767ff23825c4e5a9bc1e9).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137224923
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute the constant `$\xi_{X}$` for each point `X`
    + * and the constants `$\Psi_{\Gamma}$` and `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    --- End diff --
    
    Remove blank after ```!```.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137239566
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    --- End diff --
    
    Could we add a param ```metricName``` like other evaluator? It can only support ```silhouette``` currently, but we may add other metric in the future. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137242981
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +
    +    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
    +  }
    +
    --- End diff --
    
    It's better to add another corner case: single cluster. We should guarantee it output consistent result with sklearn. You can just select one cluster from the iris dataset and test it. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138023290
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    // Silhouette is reasonable only when the number of clusters is grater then 1
    +    assert(dataset.select($(predictionCol)).distinct().count() > 1,
    +      "Number of clusters must be greater than one.")
    +
    +    $(metricName) match {
    +      case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore(
    +        dataset,
    +        $(predictionCol),
    +        $(featuresCol)
    +      )
    +    }
    +  }
    +}
    +
    +
    +@Since("2.3.0")
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  @Since("2.3.0")
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of `i` to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the total distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} ) =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{j}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * and the average distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    --- End diff --
    
    Like above, ```d(X, C_{i} )^2``` -> ```d(X, C_{i} )```, we have consensus at last round discussion. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81639/testReport)** for PR 18538 at commit [`b0b7853`](https://github.com/apache/spark/commit/b0b7853d68c1c79bd49d6e290d3c96fe9e3af6ea).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81666/testReport)** for PR 18538 at commit [`a7c1481`](https://github.com/apache/spark/commit/a7c14818283467276a8f7eaa30b074a0f25237dc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18538


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @mgaido91 I opened [SPARK-21981](https://issues.apache.org/jira/browse/SPARK-21981) for Python API, would you like to work on it? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @yanboliang I addressed them. Thank you very much for your time, help and your great reviews.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80862 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80862/testReport)** for PR 18538 at commit [`a7db896`](https://github.com/apache/spark/commit/a7db8962745bd000da0737018eef4b1680425c90).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136573932
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,395 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    --- End diff --
    
    it is added in the line below, despite the comment is not considered outdated by github. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133372968
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,225 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.0, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.7, 3.2, 1.3, 0.2), 0),
    +      Row(Vectors.dense(4.6, 3.1, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.6, 1.4, 0.2), 0),
    +      Row(Vectors.dense(5.4, 3.9, 1.7, 0.4), 0),
    +      Row(Vectors.dense(4.6, 3.4, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.0, 3.4, 1.5, 0.2), 0),
    +      Row(Vectors.dense(4.4, 2.9, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.4, 3.7, 1.5, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.4, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.0, 1.4, 0.1), 0),
    +      Row(Vectors.dense(4.3, 3.0, 1.1, 0.1), 0),
    +      Row(Vectors.dense(5.8, 4.0, 1.2, 0.2), 0),
    +      Row(Vectors.dense(5.7, 4.4, 1.5, 0.4), 0),
    +      Row(Vectors.dense(5.4, 3.9, 1.3, 0.4), 0),
    +      Row(Vectors.dense(5.1, 3.5, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.7, 3.8, 1.7, 0.3), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.5, 0.3), 0),
    +      Row(Vectors.dense(5.4, 3.4, 1.7, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.7, 1.5, 0.4), 0),
    +      Row(Vectors.dense(4.6, 3.6, 1.0, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.3, 1.7, 0.5), 0),
    +      Row(Vectors.dense(4.8, 3.4, 1.9, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.0, 1.6, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.4, 1.6, 0.4), 0),
    +      Row(Vectors.dense(5.2, 3.5, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.2, 3.4, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.7, 3.2, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.8, 3.1, 1.6, 0.2), 0),
    +      Row(Vectors.dense(5.4, 3.4, 1.5, 0.4), 0),
    +      Row(Vectors.dense(5.2, 4.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.5, 4.2, 1.4, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(5.0, 3.2, 1.2, 0.2), 0),
    +      Row(Vectors.dense(5.5, 3.5, 1.3, 0.2), 0),
    +      Row(Vectors.dense(4.9, 3.1, 1.5, 0.1), 0),
    +      Row(Vectors.dense(4.4, 3.0, 1.3, 0.2), 0),
    +      Row(Vectors.dense(5.1, 3.4, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.5, 1.3, 0.3), 0),
    +      Row(Vectors.dense(4.5, 2.3, 1.3, 0.3), 0),
    +      Row(Vectors.dense(4.4, 3.2, 1.3, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.5, 1.6, 0.6), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.9, 0.4), 0),
    +      Row(Vectors.dense(4.8, 3.0, 1.4, 0.3), 0),
    +      Row(Vectors.dense(5.1, 3.8, 1.6, 0.2), 0),
    +      Row(Vectors.dense(4.6, 3.2, 1.4, 0.2), 0),
    +      Row(Vectors.dense(5.3, 3.7, 1.5, 0.2), 0),
    +      Row(Vectors.dense(5.0, 3.3, 1.4, 0.2), 0),
    +      Row(Vectors.dense(7.0, 3.2, 4.7, 1.4), 1),
    +      Row(Vectors.dense(6.4, 3.2, 4.5, 1.5), 1),
    +      Row(Vectors.dense(6.9, 3.1, 4.9, 1.5), 1),
    +      Row(Vectors.dense(5.5, 2.3, 4.0, 1.3), 1),
    +      Row(Vectors.dense(6.5, 2.8, 4.6, 1.5), 1),
    +      Row(Vectors.dense(5.7, 2.8, 4.5, 1.3), 1),
    +      Row(Vectors.dense(6.3, 3.3, 4.7, 1.6), 1),
    +      Row(Vectors.dense(4.9, 2.4, 3.3, 1.0), 1),
    +      Row(Vectors.dense(6.6, 2.9, 4.6, 1.3), 1),
    +      Row(Vectors.dense(5.2, 2.7, 3.9, 1.4), 1),
    +      Row(Vectors.dense(5.0, 2.0, 3.5, 1.0), 1),
    +      Row(Vectors.dense(5.9, 3.0, 4.2, 1.5), 1),
    +      Row(Vectors.dense(6.0, 2.2, 4.0, 1.0), 1),
    +      Row(Vectors.dense(6.1, 2.9, 4.7, 1.4), 1),
    +      Row(Vectors.dense(5.6, 2.9, 3.6, 1.3), 1),
    +      Row(Vectors.dense(6.7, 3.1, 4.4, 1.4), 1),
    +      Row(Vectors.dense(5.6, 3.0, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.8, 2.7, 4.1, 1.0), 1),
    +      Row(Vectors.dense(6.2, 2.2, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.6, 2.5, 3.9, 1.1), 1),
    +      Row(Vectors.dense(5.9, 3.2, 4.8, 1.8), 1),
    +      Row(Vectors.dense(6.1, 2.8, 4.0, 1.3), 1),
    +      Row(Vectors.dense(6.3, 2.5, 4.9, 1.5), 1),
    +      Row(Vectors.dense(6.1, 2.8, 4.7, 1.2), 1),
    +      Row(Vectors.dense(6.4, 2.9, 4.3, 1.3), 1),
    +      Row(Vectors.dense(6.6, 3.0, 4.4, 1.4), 1),
    +      Row(Vectors.dense(6.8, 2.8, 4.8, 1.4), 1),
    +      Row(Vectors.dense(6.7, 3.0, 5.0, 1.7), 1),
    +      Row(Vectors.dense(6.0, 2.9, 4.5, 1.5), 1),
    +      Row(Vectors.dense(5.7, 2.6, 3.5, 1.0), 1),
    +      Row(Vectors.dense(5.5, 2.4, 3.8, 1.1), 1),
    +      Row(Vectors.dense(5.5, 2.4, 3.7, 1.0), 1),
    +      Row(Vectors.dense(5.8, 2.7, 3.9, 1.2), 1),
    +      Row(Vectors.dense(6.0, 2.7, 5.1, 1.6), 1),
    +      Row(Vectors.dense(5.4, 3.0, 4.5, 1.5), 1),
    +      Row(Vectors.dense(6.0, 3.4, 4.5, 1.6), 1),
    +      Row(Vectors.dense(6.7, 3.1, 4.7, 1.5), 1),
    +      Row(Vectors.dense(6.3, 2.3, 4.4, 1.3), 1),
    +      Row(Vectors.dense(5.6, 3.0, 4.1, 1.3), 1),
    +      Row(Vectors.dense(5.5, 2.5, 4.0, 1.3), 1),
    +      Row(Vectors.dense(5.5, 2.6, 4.4, 1.2), 1),
    +      Row(Vectors.dense(6.1, 3.0, 4.6, 1.4), 1),
    +      Row(Vectors.dense(5.8, 2.6, 4.0, 1.2), 1),
    +      Row(Vectors.dense(5.0, 2.3, 3.3, 1.0), 1),
    +      Row(Vectors.dense(5.6, 2.7, 4.2, 1.3), 1),
    +      Row(Vectors.dense(5.7, 3.0, 4.2, 1.2), 1),
    +      Row(Vectors.dense(5.7, 2.9, 4.2, 1.3), 1),
    +      Row(Vectors.dense(6.2, 2.9, 4.3, 1.3), 1),
    +      Row(Vectors.dense(5.1, 2.5, 3.0, 1.1), 1),
    +      Row(Vectors.dense(5.7, 2.8, 4.1, 1.3), 1),
    +      Row(Vectors.dense(6.3, 3.3, 6.0, 2.5), 2),
    +      Row(Vectors.dense(5.8, 2.7, 5.1, 1.9), 2),
    +      Row(Vectors.dense(7.1, 3.0, 5.9, 2.1), 2),
    +      Row(Vectors.dense(6.3, 2.9, 5.6, 1.8), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.8, 2.2), 2),
    +      Row(Vectors.dense(7.6, 3.0, 6.6, 2.1), 2),
    +      Row(Vectors.dense(4.9, 2.5, 4.5, 1.7), 2),
    +      Row(Vectors.dense(7.3, 2.9, 6.3, 1.8), 2),
    +      Row(Vectors.dense(6.7, 2.5, 5.8, 1.8), 2),
    +      Row(Vectors.dense(7.2, 3.6, 6.1, 2.5), 2),
    +      Row(Vectors.dense(6.5, 3.2, 5.1, 2.0), 2),
    +      Row(Vectors.dense(6.4, 2.7, 5.3, 1.9), 2),
    +      Row(Vectors.dense(6.8, 3.0, 5.5, 2.1), 2),
    +      Row(Vectors.dense(5.7, 2.5, 5.0, 2.0), 2),
    +      Row(Vectors.dense(5.8, 2.8, 5.1, 2.4), 2),
    +      Row(Vectors.dense(6.4, 3.2, 5.3, 2.3), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.5, 1.8), 2),
    +      Row(Vectors.dense(7.7, 3.8, 6.7, 2.2), 2),
    +      Row(Vectors.dense(7.7, 2.6, 6.9, 2.3), 2),
    +      Row(Vectors.dense(6.0, 2.2, 5.0, 1.5), 2),
    +      Row(Vectors.dense(6.9, 3.2, 5.7, 2.3), 2),
    +      Row(Vectors.dense(5.6, 2.8, 4.9, 2.0), 2),
    +      Row(Vectors.dense(7.7, 2.8, 6.7, 2.0), 2),
    +      Row(Vectors.dense(6.3, 2.7, 4.9, 1.8), 2),
    +      Row(Vectors.dense(6.7, 3.3, 5.7, 2.1), 2),
    +      Row(Vectors.dense(7.2, 3.2, 6.0, 1.8), 2),
    +      Row(Vectors.dense(6.2, 2.8, 4.8, 1.8), 2),
    +      Row(Vectors.dense(6.1, 3.0, 4.9, 1.8), 2),
    +      Row(Vectors.dense(6.4, 2.8, 5.6, 2.1), 2),
    +      Row(Vectors.dense(7.2, 3.0, 5.8, 1.6), 2),
    +      Row(Vectors.dense(7.4, 2.8, 6.1, 1.9), 2),
    +      Row(Vectors.dense(7.9, 3.8, 6.4, 2.0), 2),
    +      Row(Vectors.dense(6.4, 2.8, 5.6, 2.2), 2),
    +      Row(Vectors.dense(6.3, 2.8, 5.1, 1.5), 2),
    +      Row(Vectors.dense(6.1, 2.6, 5.6, 1.4), 2),
    +      Row(Vectors.dense(7.7, 3.0, 6.1, 2.3), 2),
    +      Row(Vectors.dense(6.3, 3.4, 5.6, 2.4), 2),
    +      Row(Vectors.dense(6.4, 3.1, 5.5, 1.8), 2),
    +      Row(Vectors.dense(6.0, 3.0, 4.8, 1.8), 2),
    +      Row(Vectors.dense(6.9, 3.1, 5.4, 2.1), 2),
    +      Row(Vectors.dense(6.7, 3.1, 5.6, 2.4), 2),
    +      Row(Vectors.dense(6.9, 3.1, 5.1, 2.3), 2),
    +      Row(Vectors.dense(5.8, 2.7, 5.1, 1.9), 2),
    +      Row(Vectors.dense(6.8, 3.2, 5.9, 2.3), 2),
    +      Row(Vectors.dense(6.7, 3.3, 5.7, 2.5), 2),
    +      Row(Vectors.dense(6.7, 3.0, 5.2, 2.3), 2),
    +      Row(Vectors.dense(6.3, 2.5, 5.0, 1.9), 2),
    +      Row(Vectors.dense(6.5, 3.0, 5.2, 2.0), 2),
    +      Row(Vectors.dense(6.2, 3.4, 5.4, 2.3), 2),
    +      Row(Vectors.dense(5.9, 3.0, 5.1, 1.8), 2))
    +
    +  val dsStruct = StructType( Seq(
    +    StructField("point", new VectorUDT, nullable = false),
    +    StructField("label", IntegerType, nullable = false)
    +  ))
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +      .setMetricName("squaredSilhouette")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +  Use the following python code to load the data and evaluate it using scikit-learn package.
    --- End diff --
    
    you should add the expected output of your python code, refer to [FPGrowthSuite.scala
    ](https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/mllib/src/test/scala/org/apache/spark/mllib/fpm/FPGrowthSuite.scala#L71), and mind the indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136304803
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + * </blockquote>
    --- End diff --
    
    The latex formula should be surrounded by $$, change here and other places as:
    ```
    <blockquote>
        $$
        s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
        $$
    </blockquote>
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81666/testReport)** for PR 18538 at commit [`a7c1481`](https://github.com/apache/spark/commit/a7c14818283467276a8f7eaa30b074a0f25237dc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81369/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @yanboliang yes, thank you very much.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @mgaido91 I left some minor comments, otherwise, this looks good. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136306135
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   s_{i}=\left\{ \begin{tabular}{cc}
    + *   $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
    + *   $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$
    + * </blockquote>
    + *
    + * where `a(i)` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `b(i)` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `a(i)` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `b(i)` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `C_{i}` belonging to the cluster `\Gamma` is:
    + *
    + * <blockquote>
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + * </blockquote>
    + *
    + * where `x_{j}` is the `j`-th dimension of the point `X` and
    + * `c_{ij} is the `j`-th dimension of the `i`-th point in cluster `\Gamma`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} ,
    + *   with \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + * </blockquote>
    + *
    + * where `\xi_{X}` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `\Psi_{\Gamma}`
    + *
    + * <blockquote>
    + *   sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    --- End diff --
    
    Ditto, there is syntax error in this latex formula.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136536168
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val result = BigDecimal(0.6564679231)
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +    val actual = BigDecimal(evaluator.evaluate(iris))
    +      .setScale(10, BigDecimal.RoundingMode.HALF_UP)
    +
    +    assertResult(result)(actual)
    --- End diff --
    
    You can use `A ~== B relTol 1e-10`. No need `BigDecimal` I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137310194
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +
    +    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
    +  }
    +
    --- End diff --
    
    yes, I agree. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137239772
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    --- End diff --
    
    ```@Since("2.3.0")```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137261873
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    --- End diff --
    
    @zhengruifeng I think you asked me to remove it, any concern if I add it back? Thanks..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137226738
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute the constant `$\xi_{X}$` for each point `X`
    + * and the constants `$\Psi_{\Gamma}$` and `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `\Psi_{\Gamma}` and `Y_{\Gamma}` for a cluster).
    +   */
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  /**
    +   * It computes the Silhouette coefficient for a point.
    +   *
    +   * @param broadcastedClustersMap A map of the precomputed values for each cluster.
    +   * @param features The [[org.apache.spark.ml.linalg.Vector]] representing the current point.
    +   * @param clusterId The id of the cluster the current point belongs to.
    +   * @param squaredNorm The `\Xi_{X}` (which is the squared norm) precomputed for the point.
    +   * @return The Silhouette for the point.
    +   */
    +  def computeSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     features: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    // Here we compute the average dissimilarity of the
    +    // current point to any cluster of which the point
    +    // is not a member.
    +    // The cluster with the lowest average dissimilarity
    +    // - i.e. the nearest cluster to the current point -
    +    // is said to be the "neighboring cluster".
    +    var neighboringClusterDissimilarity = Double.MaxValue
    +    broadcastedClustersMap.value.keySet.foreach {
    +      c =>
    +        if (c != clusterId) {
    +          val dissimilarity = compute(squaredNorm, features, broadcastedClustersMap.value(c))
    +          if(dissimilarity < neighboringClusterDissimilarity) {
    +            neighboringClusterDissimilarity = dissimilarity
    +          }
    +        }
    +    }
    +    val currentCluster = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val currentClusterDissimilarity = if (currentCluster.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, features, currentCluster) * currentCluster.numOfPoints /
    +        (currentCluster.numOfPoints - 1)
    +    }
    +
    +    (currentClusterDissimilarity compare neighboringClusterDissimilarity).signum match {
    +      case -1 => 1 - (currentClusterDissimilarity / neighboringClusterDissimilarity)
    +      case 1 => (neighboringClusterDissimilarity / currentClusterDissimilarity) - 1
    +      case 0 => 0.0
    +    }
    +  }
    +
    +  /**
    +   * Compute the mean Silhouette values of all samples.
    +   *
    +   * @param dataset The input dataset (previously clustered) on which compute the Silhouette.
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return The average of the Silhouette values of the clustered data.
    +   */
    +  def computeSilhouetteScore(dataset: Dataset[_],
    --- End diff --
    
    Move ```dataset: Dataset[_],``` to the next line.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137239933
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    --- End diff --
    
    ```@Since("2.3.0")``` 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81463/testReport)** for PR 18538 at commit [`7b8149a`](https://github.com/apache/spark/commit/7b8149a3f5fab0f5667b342d76fe3ea1bfc6ce81).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137238642
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute the constant `$\xi_{X}$` for each point `X`
    + * and the constants `$\Psi_{\Gamma}$` and `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `\Psi_{\Gamma}` and `Y_{\Gamma}` for a cluster).
    +   */
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  /**
    +   * It computes the Silhouette coefficient for a point.
    +   *
    +   * @param broadcastedClustersMap A map of the precomputed values for each cluster.
    +   * @param features The [[org.apache.spark.ml.linalg.Vector]] representing the current point.
    +   * @param clusterId The id of the cluster the current point belongs to.
    +   * @param squaredNorm The `\Xi_{X}` (which is the squared norm) precomputed for the point.
    +   * @return The Silhouette for the point.
    +   */
    +  def computeSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     features: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    // Here we compute the average dissimilarity of the
    +    // current point to any cluster of which the point
    +    // is not a member.
    +    // The cluster with the lowest average dissimilarity
    +    // - i.e. the nearest cluster to the current point -
    +    // is said to be the "neighboring cluster".
    +    var neighboringClusterDissimilarity = Double.MaxValue
    +    broadcastedClustersMap.value.keySet.foreach {
    +      c =>
    +        if (c != clusterId) {
    +          val dissimilarity = compute(squaredNorm, features, broadcastedClustersMap.value(c))
    +          if(dissimilarity < neighboringClusterDissimilarity) {
    +            neighboringClusterDissimilarity = dissimilarity
    +          }
    +        }
    +    }
    +    val currentCluster = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val currentClusterDissimilarity = if (currentCluster.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, features, currentCluster) * currentCluster.numOfPoints /
    +        (currentCluster.numOfPoints - 1)
    +    }
    +
    +    (currentClusterDissimilarity compare neighboringClusterDissimilarity).signum match {
    +      case -1 => 1 - (currentClusterDissimilarity / neighboringClusterDissimilarity)
    +      case 1 => (neighboringClusterDissimilarity / currentClusterDissimilarity) - 1
    +      case 0 => 0.0
    +    }
    +  }
    +
    +  /**
    +   * Compute the mean Silhouette values of all samples.
    +   *
    +   * @param dataset The input dataset (previously clustered) on which compute the Silhouette.
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return The average of the Silhouette values of the clustered data.
    +   */
    +  def computeSilhouetteScore(dataset: Dataset[_],
    +      predictionCol: String,
    +      featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNormUDF = udf {
    +      features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNormUDF(col(featuresCol)))
    +
    +    // compute aggregate values for clusters
    +    // needed by the algorithm
    +    val clustersStatsMap = SquaredEuclideanSilhouette
    +      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
    +
    +    val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
    --- End diff --
    
    It's better to destroy this broadcast variable explicitly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133372318
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,225 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    --- End diff --
    
    @mgaido91 You can set seed to control the randomly generated data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138255648
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,438 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    --- End diff --
    
    Usually we leave a blank line under ```:: Experimental ::```.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133157195
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    --- End diff --
    
    I'd suggest rename ```computeSquaredSilhouetteCoefficient``` to ```computeSilhouetteCoefficient```, since this function is already inside of class ```SquaredEuclideanSilhouette```, it doesn't necessary to highlight ```SquaredEuclidean```. What do you think of it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137275123
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +
    +    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
    +  }
    +
    --- End diff --
    
    Actually sklearn throws an exception in this case. Should we do the same? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137180194
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    --- End diff --
    
    ```the average of the distance of the point``` -> ```the total distance of the point```? Should it be the total distance rather than the average distance?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133385546
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,225 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    --- End diff --
    
    Sorry but I can't understand your point. Resources in the test scope are not included in the compiled jars. The same approach is used in the `sql` component for instance, where the test data is in the resources (https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data).
    If I generate randomly test data, I have to first perform a clustering on those points, while with this dataset I have the result of the clustering ready too. I am not sure this is the best approach. But maybe I am missing something. Can you please clarify this to me?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133958240
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    --- End diff --
    
    since I'd like to stick with the naming convention in the papers I added a comment before the variable to explain the definition of "neighboring cluster", in this way I think it is very clear what this variable means and contains. Is it ok for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133372183
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    +  }
    +
    +  def computeSquaredSilhouette(dataset: Dataset[_],
    +    predictionCol: String,
    +    featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNorm = udf {
    +      features: Vector =>
    +        math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNorm(col(featuresCol)))
    +
    +    // compute aggregate values for clusters
    +    // needed by the algorithm
    +    val clustersStatsMap = SquaredEuclideanSilhouette
    +      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
    +
    +    val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
    +
    +    val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",
    +      computeSquaredSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Int, _: Double)
    +    )
    +
    +    val squaredSilhouetteDF = dfWithSquaredNorm
    --- End diff --
    
    Use `select(avg(computeSilhouette(...)))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133185265
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,225 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    --- End diff --
    
    Unfortunately {{KMeansSuite}} and {{GaussianMixtureSuite}} use randomly generated data: thus it is not possible to know which should be the output value for the Silhouette in advance. What if I move the data to a resource file and read it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136311477
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   s_{i}=\left\{ \begin{tabular}{cc}
    + *   $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
    + *   $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$
    --- End diff --
    
    thanks @yanboliang, may you please tell me how to check the generated doc? thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137261509
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    --- End diff --
    
    No problem at all, it is just to know which is the way to go. Then I'll add it back, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133371918
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, vector, clusterCurrentPoint) * clusterCurrentPoint.numOfPoints /
    +        (clusterCurrentPoint.numOfPoints - 1)
    +    }
    +
    +    var silhouetteCoeff = 0.0
    +    if (clusterSil < minOther) {
    +      silhouetteCoeff = 1 - (clusterSil / minOther)
    +    } else {
    +      if (clusterSil > minOther) {
    +        silhouetteCoeff = (minOther / clusterSil) - 1
    +      }
    +    }
    +    silhouetteCoeff
    +
    +  }
    +
    +  def computeSquaredSilhouette(dataset: Dataset[_],
    +    predictionCol: String,
    +    featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNorm = udf {
    +      features: Vector =>
    +        math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNorm(col(featuresCol)))
    +
    +    // compute aggregate values for clusters
    +    // needed by the algorithm
    +    val clustersStatsMap = SquaredEuclideanSilhouette
    +      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
    +
    +    val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
    +
    +    val computeSilhouette = dataset.sparkSession.udf.register("computeSilhouette",
    --- End diff --
    
    Why not follow the above way of creating udf `squaredNorm`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @mgaido91 These are my last comments, it should be ready to merge once they are addressed. Thanks for your contribution.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80281/testReport)** for PR 18538 at commit [`cfcb106`](https://github.com/apache/spark/commit/cfcb106788e5ea2b905767ff23825c4e5a9bc1e9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @mgaido91 Don't worry, I'll post a follow-up PR for discussion in a few days. Thanks. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138024385
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    // Silhouette is reasonable only when the number of clusters is grater then 1
    +    assert(dataset.select($(predictionCol)).distinct().count() > 1,
    +      "Number of clusters must be greater than one.")
    +
    +    $(metricName) match {
    +      case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore(
    +        dataset,
    +        $(predictionCol),
    +        $(featuresCol)
    +      )
    +    }
    +  }
    +}
    +
    +
    +@Since("2.3.0")
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  @Since("2.3.0")
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of `i` to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the total distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} ) =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{j}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * and the average distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute: the constant `$\xi_{X}$` for each point `X`; the
    + * constants `$\Psi_{\Gamma}$`, `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (!kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    --- End diff --
    
    ```the cluster id``` -> ```the predicted cluster id```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81316/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137175816
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    --- End diff --
    
    ```of to``` -> ```of `i` to```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138021102
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    // Silhouette is reasonable only when the number of clusters is grater then 1
    +    assert(dataset.select($(predictionCol)).distinct().count() > 1,
    +      "Number of clusters must be greater than one.")
    +
    +    $(metricName) match {
    +      case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore(
    +        dataset,
    +        $(predictionCol),
    +        $(featuresCol)
    +      )
    +    }
    +  }
    +}
    +
    +
    +@Since("2.3.0")
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  @Since("2.3.0")
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of `i` to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    --- End diff --
    
    Remove duplicated ```as```.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133571846
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    --- End diff --
    
    Yes, the idea often crosses my mind.
    Even though there's a claim that [K-Means is for Euclidean distances only](https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric), I often see people has the requirement for custom distance computation in practice. So I would like to see KMeans support it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133961918
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    --- End diff --
    
    Sounds good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133747305
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    --- End diff --
    
    I'd go for `neighborClusterDissimilarity` which is the terminology used in the papers and also in the wiki page, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138024573
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    // Silhouette is reasonable only when the number of clusters is grater then 1
    +    assert(dataset.select($(predictionCol)).distinct().count() > 1,
    +      "Number of clusters must be greater than one.")
    +
    +    $(metricName) match {
    +      case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore(
    +        dataset,
    +        $(predictionCol),
    +        $(featuresCol)
    +      )
    +    }
    +  }
    +}
    +
    +
    +@Since("2.3.0")
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  @Since("2.3.0")
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of `i` to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the total distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} ) =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{j}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * and the average distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute: the constant `$\xi_{X}$` for each point `X`; the
    + * constants `$\Psi_{\Gamma}$`, `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (!kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `$\Psi_{\Gamma}$` and `$Y_{\Gamma}$` for a cluster).
    +   */
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  /**
    +   * It computes the Silhouette coefficient for a point.
    +   *
    +   * @param broadcastedClustersMap A map of the precomputed values for each cluster.
    +   * @param features The [[org.apache.spark.ml.linalg.Vector]] representing the current point.
    +   * @param clusterId The id of the cluster the current point belongs to.
    +   * @param squaredNorm The `$\Xi_{X}$` (which is the squared norm) precomputed for the point.
    +   * @return The Silhouette for the point.
    +   */
    +  def computeSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     features: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    // Here we compute the average dissimilarity of the
    +    // current point to any cluster of which the point
    +    // is not a member.
    +    // The cluster with the lowest average dissimilarity
    +    // - i.e. the nearest cluster to the current point -
    +    // is said to be the "neighboring cluster".
    +    var neighboringClusterDissimilarity = Double.MaxValue
    +    broadcastedClustersMap.value.keySet.foreach {
    +      c =>
    +        if (c != clusterId) {
    +          val dissimilarity = compute(squaredNorm, features, broadcastedClustersMap.value(c))
    +          if(dissimilarity < neighboringClusterDissimilarity) {
    +            neighboringClusterDissimilarity = dissimilarity
    +          }
    +        }
    +    }
    +    val currentCluster = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val currentClusterDissimilarity = if (currentCluster.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, features, currentCluster) * currentCluster.numOfPoints /
    +        (currentCluster.numOfPoints - 1)
    +    }
    +
    +    (currentClusterDissimilarity compare neighboringClusterDissimilarity).signum match {
    +      case -1 => 1 - (currentClusterDissimilarity / neighboringClusterDissimilarity)
    +      case 1 => (neighboringClusterDissimilarity / currentClusterDissimilarity) - 1
    +      case 0 => 0.0
    +    }
    +  }
    +
    +  /**
    +   * Compute the mean Silhouette values of all samples.
    +   *
    +   * @param dataset The input dataset (previously clustered) on which compute the Silhouette.
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    --- End diff --
    
    Ditto.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136333104
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    --- End diff --
    
    BTW, we have necessary docs at ```object SquaredEuclideanSilhouette``` to explain our proposed algorithm, so we can remove this. Usually we only refer to public publication.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @gatorsmile Could you help to trigger the test job? It seems I can't do it now. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131889868
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.{col, count, sum}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(Y: Vector, psi: Double, count: Long)
    +
    +  def computeCsi(vector: Vector): Double = {
    +    var sumOfSquares = 0.0
    +    vector.foreachActive((_, v) => {
    +      sumOfSquares += v * v
    +    })
    +    sumOfSquares
    +  }
    +
    +  def computeYVectorPsiAndCount(
    +      df: DataFrame,
    +      predictionCol: String,
    +      featuresCol: String): DataFrame = {
    +    val Yudaf = new VectorElementWiseSum()
    +    df.groupBy(predictionCol)
    +      .agg(
    +        count("*").alias("count"),
    +        sum("csi").alias("psi"),
    +        Yudaf(col(featuresCol)).alias("y")
    +      )
    --- End diff --
    
    Aggregate function performance is not ideal for column of non-primitive type(like here is vector type). So we would still use RDD-based aggregate. You can factor this part of code following [```NaiveBayes```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala#L161) like:
    ```
        import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors}
        import org.apache.spark.sql.functions._
        
        val numFeatures = ...
        val squaredNorm = udf { features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0) }
    
        df.select(col(predictionCol), col(featuresCol))
          .withColumn("squaredNorm", squaredNorm(col(featuresCol)))
          .rdd
          .map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2))) }
          .aggregateByKey[(DenseVector, Double)]((Vectors.zeros(numFeatures).toDense, 0.0))(
          seqOp = {
            case ((featureSum: DenseVector, squaredNormSum: Double), (features, squaredNorm)) =>
              BLAS.axpy(1.0, features, featureSum)
              (featureSum, squaredNormSum + squaredNorm)
          },
          combOp = {
            case ((featureSum1, squaredNormSum1), (featureSum2, squaredNormSum2)) =>
              BLAS.axpy(1.0, featureSum2, featureSum1)
              (featureSum1, squaredNormSum1 + squaredNormSum2)
          }).collect()
    ```
    In my suggestion, you can compute ```csi``` and ```y``` in a single data pass, which should be more efficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137253446
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    --- End diff --
    
    Sorry if the comment was left by me. Anyway, I think we should add it, since this class is ```ClusteringEvaluator``` rather than silhouette metric, users should know which metric they are using. And we will support more metrics in the future. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r131890318
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/SquaredEuclideanSilhouette.scala ---
    @@ -0,0 +1,115 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{Vector, VectorElementWiseSum}
    +import org.apache.spark.sql.DataFrame
    +import org.apache.spark.sql.functions.{col, count, sum}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(Y: Vector, psi: Double, count: Long)
    +
    +  def computeCsi(vector: Vector): Double = {
    --- End diff --
    
    Can we use ```Vectors.norm(vector, 2.0)```? It should be more efficient for both dense and sparse vector. Actually we can remove this function if you refactor code as my suggested below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80285 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80285/testReport)** for PR 18538 at commit [`923418a`](https://github.com/apache/spark/commit/923418a7139e9cd038882499e7ac0aa544a14858).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133175365
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    --- End diff --
    
    ```minOther``` -> ```nearestClusterDistance```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137240370
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +
    +    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
    --- End diff --
    
    Check with tolerance 1e-5 is good enough.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136305238
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   s_{i}=\left\{ \begin{tabular}{cc}
    + *   $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
    + *   $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$
    + * </blockquote>
    + *
    + * where `a(i)` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `b(i)` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `a(i)` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `b(i)` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `C_{i}` belonging to the cluster `\Gamma` is:
    --- End diff --
    
    `C_{i}` -> $C_{i}$, otherwise, it can't generated correct doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81287/testReport)** for PR 18538 at commit [`45d1380`](https://github.com/apache/spark/commit/45d1380574ece58ff63c34ff31af6243aff16c3c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137251472
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    --- End diff --
    
    As above, I am checking for the typo everywhere, thanks for pointing it out.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137180329
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    --- End diff --
    
    I think here we should highlight it's the average distance.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137180832
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute the constant `$\xi_{X}$` for each point `X`
    + * and the constants `$\Psi_{\Gamma}$` and `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    --- End diff --
    
    Too many ```and```.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138025640
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    // Silhouette is reasonable only when the number of clusters is grater then 1
    +    assert(dataset.select($(predictionCol)).distinct().count() > 1,
    --- End diff --
    
    Move this check to L418, in case another unnecessary computation for most of the cases(cluster size > 1). See my comment at L418.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138025184
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,437 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    // Silhouette is reasonable only when the number of clusters is grater then 1
    +    assert(dataset.select($(predictionCol)).distinct().count() > 1,
    +      "Number of clusters must be greater than one.")
    +
    +    $(metricName) match {
    +      case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore(
    +        dataset,
    +        $(predictionCol),
    +        $(featuresCol)
    +      )
    +    }
    +  }
    +}
    +
    +
    +@Since("2.3.0")
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  @Since("2.3.0")
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of `i` to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the total distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} ) =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{j}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{j} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}
    + *   $$
    + * </blockquote>
    + *
    + * and the average distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{j}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute: the constant `$\xi_{X}$` for each point `X`; the
    + * constants `$\Psi_{\Gamma}$`, `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (!kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `$\Psi_{\Gamma}$` and `$Y_{\Gamma}$` for a cluster).
    +   */
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  /**
    +   * It computes the Silhouette coefficient for a point.
    +   *
    +   * @param broadcastedClustersMap A map of the precomputed values for each cluster.
    +   * @param features The [[org.apache.spark.ml.linalg.Vector]] representing the current point.
    +   * @param clusterId The id of the cluster the current point belongs to.
    +   * @param squaredNorm The `$\Xi_{X}$` (which is the squared norm) precomputed for the point.
    +   * @return The Silhouette for the point.
    +   */
    +  def computeSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     features: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    // Here we compute the average dissimilarity of the
    +    // current point to any cluster of which the point
    +    // is not a member.
    +    // The cluster with the lowest average dissimilarity
    +    // - i.e. the nearest cluster to the current point -
    +    // is said to be the "neighboring cluster".
    +    var neighboringClusterDissimilarity = Double.MaxValue
    +    broadcastedClustersMap.value.keySet.foreach {
    +      c =>
    +        if (c != clusterId) {
    +          val dissimilarity = compute(squaredNorm, features, broadcastedClustersMap.value(c))
    +          if(dissimilarity < neighboringClusterDissimilarity) {
    +            neighboringClusterDissimilarity = dissimilarity
    +          }
    +        }
    +    }
    +    val currentCluster = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val currentClusterDissimilarity = if (currentCluster.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, features, currentCluster) * currentCluster.numOfPoints /
    +        (currentCluster.numOfPoints - 1)
    +    }
    +
    +    (currentClusterDissimilarity compare neighboringClusterDissimilarity).signum match {
    +      case -1 => 1 - (currentClusterDissimilarity / neighboringClusterDissimilarity)
    +      case 1 => (neighboringClusterDissimilarity / currentClusterDissimilarity) - 1
    +      case 0 => 0.0
    +    }
    +  }
    +
    +  /**
    +   * Compute the mean Silhouette values of all samples.
    +   *
    +   * @param dataset The input dataset (previously clustered) on which compute the Silhouette.
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return The average of the Silhouette values of the clustered data.
    +   */
    +  def computeSilhouetteScore(
    +      dataset: Dataset[_],
    +      predictionCol: String,
    +      featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNormUDF = udf {
    +      features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNormUDF(col(featuresCol)))
    +
    +    // compute aggregate values for clusters needed by the algorithm
    +    val clustersStatsMap = SquaredEuclideanSilhouette
    +      .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol)
    --- End diff --
    
    We can check whether the number of clusters is grater then 1 at here to avoid unnecessary computation. 
    ```
    assert(clustersStatsMap.size != 1, "...")
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136373078
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   s_{i}=\left\{ \begin{tabular}{cc}
    + *   $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
    + *   $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$
    --- End diff --
    
    thank you! You're always nice. Just fixed everything, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133185455
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    --- End diff --
    
    I forgot to remove this line, I am doing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81463/testReport)** for PR 18538 at commit [`7b8149a`](https://github.com/apache/spark/commit/7b8149a3f5fab0f5667b342d76fe3ea1bfc6ce81).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133178531
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,225 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    --- End diff --
    
    It's good to have this to verify the correctness of your implementation, but usually we don't hard code so much data for test. Could you try to find existing data in ```KMeansSuite``` or ```GaussianMixtureSuite``` for testing? If the hard code is necessary, please try to use small dataset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133175278
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    +    for(c <- broadcastedClustersMap.value.keySet) {
    +      if (c != clusterId) {
    +        val sil = compute(squaredNorm, vector, broadcastedClustersMap.value(c))
    +        if(sil < minOther) {
    +          minOther = sil
    +        }
    +      }
    +    }
    +    val clusterCurrentPoint = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val clusterSil = if (clusterCurrentPoint.numOfPoints == 1) {
    --- End diff --
    
    ```clusterSil``` -> ```intraClusterDistance```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    I'm merging this into master, thanks for all. If anyone has more comments, we can address them in follow-up PRs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    @jkbradley @mgaido91 I just sent #19648 to move test data to data/mllib, please feel free to review it. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80860/testReport)** for PR 18538 at commit [`a4ca3cd`](https://github.com/apache/spark/commit/a4ca3cd18852abc8076905a586c6b0f4b622cff6).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137249408
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    --- End diff --
    
    At the beginning it was like that but then in some comments I had been asked to remove it as it is useless at the moment.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133876325
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    +      case "squaredSilhouette" =>
    +        SquaredEuclideanSilhouette.computeSquaredSilhouette(
    +          dataset,
    +          $(predictionCol),
    +          $(featuresCol)
    +        )
    +    }
    +    metric
    +  }
    +
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  def computeSquaredSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     vector: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    var minOther = Double.MaxValue
    --- End diff --
    
    I'd suggest for ```nearestNeighborClusterDissimilarity``` or ```nearestClusterDissimilarity```, as we should highlight this is the _nearest_ one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133360674
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,225 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
    +
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  val dataset = Seq(Row(Vectors.dense(5.1, 3.5, 1.4, 0.2), 0),
    --- End diff --
    
    I think we can't put test data in resource file, as resource file will be packaged in the final jar file. What about randomly generated some small data in Python and hard code them here? Just like what we did in [```GaussianMixtureSuite``` ](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala#L195).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #81316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136532646
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,395 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    --- End diff --
    
    add `:: Experimental ::` for doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r133368511
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,240 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * At the moment, the supported metrics are:
    + *  squaredSilhouette: silhouette measure using the squared Euclidean distance;
    + *  cosineSilhouette: silhouette measure using the cosine distance.
    + *  The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + *   in this document</a>.
    + */
    +@Experimental
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("SquaredEuclideanSilhouette"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"squaredSilhouette"` (default))
    +   * @group param
    +   */
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (squaredSilhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "squaredSilhouette")
    +
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    val metric: Double = $(metricName) match {
    --- End diff --
    
    If only Euclidean is support for now, here `val metric` and `match` are not needed, directly return `SquaredEuclideanSilhouette.computeSquaredSilhouette...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137242127
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala ---
    @@ -0,0 +1,89 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, SparkSession}
    +
    +
    +private[ml] case class ClusteringEvaluationTestData(features: Vector, label: Int)
    +
    +class ClusteringEvaluatorSuite
    +  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new ClusteringEvaluator)
    +  }
    +
    +  test("read/write") {
    +    val evaluator = new ClusteringEvaluator()
    +      .setPredictionCol("myPrediction")
    +      .setFeaturesCol("myLabel")
    +    testDefaultReadWrite(evaluator)
    +  }
    +
    +  /*
    +    Use the following python code to load the data and evaluate it using scikit-learn package.
    +
    +    from sklearn import datasets
    +    from sklearn.metrics import silhouette_score
    +    iris = datasets.load_iris()
    +    round(silhouette_score(iris.data, iris.target, metric='sqeuclidean'), 10)
    +
    +    0.6564679231
    +  */
    +  test("squared euclidean Silhouette") {
    +    val iris = ClusteringEvaluatorSuite.irisDataset(spark)
    +    val evaluator = new ClusteringEvaluator()
    +        .setFeaturesCol("features")
    +        .setPredictionCol("label")
    +
    +    assert(evaluator.evaluate(iris) ~== 0.6564679231 relTol 1e-10)
    +  }
    +
    +}
    +
    +object ClusteringEvaluatorSuite {
    +  def irisDataset(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +
    +    val irisCsvPath = Thread.currentThread()
    +      .getContextClassLoader
    +      .getResource("test-data/iris.csv")
    +      .toString
    +
    +    spark.sparkContext
    +      .textFile(irisCsvPath)
    +      .map {
    +        line =>
    +          val splits = line.split(",")
    +          ClusteringEvaluationTestData(
    +            Vectors.dense(splits.take(splits.length-1).map(_.toDouble)),
    +            splits(splits.length-1).toInt
    +          )
    +      }
    +      .toDF()
    --- End diff --
    
    Can we store the test data as libsvm format rather than csv? Then we can use ```spark.read.format("libsvm").load(irisPath)``` to load it to a DataFrame with two columns: features and label.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137226318
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute the constant `$\xi_{X}$` for each point `X`
    + * and the constants `$\Psi_{\Gamma}$` and `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `\Psi_{\Gamma}` and `Y_{\Gamma}` for a cluster).
    +   */
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  /**
    +   * It computes the Silhouette coefficient for a point.
    +   *
    +   * @param broadcastedClustersMap A map of the precomputed values for each cluster.
    +   * @param features The [[org.apache.spark.ml.linalg.Vector]] representing the current point.
    +   * @param clusterId The id of the cluster the current point belongs to.
    +   * @param squaredNorm The `\Xi_{X}` (which is the squared norm) precomputed for the point.
    --- End diff --
    
    Ditto, ```\Xi_{X}``` should be surrounded by ```$```.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r137226969
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,396 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + *   $$
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   $$
    + *   s_{i}= \begin{cases}
    + *   1-\frac{a_{i}}{b_{i}} & \text{if } a_{i} \leq b_{i} \\
    + *   \frac{b_{i}}{a_{i}}-1 & \text{if } a_{i} \gt b_{i} \end{cases}
    + *   $$
    + * </blockquote>
    + *
    + * where `$a_{i}$` is the average dissimilarity of `i` with all other data
    + * within the same cluster, `$b_{i}$` is the lowest average dissimilarity
    + * of to any other cluster, of which `i` is not a member.
    + * `$a_{i}$` can be interpreted as as how well `i` is assigned to its cluster
    + * (the smaller the value, the better the assignment), while `$b_{i}$` is
    + * a measure of how well `i` has not been assigned to its "neighboring cluster",
    + * ie. the nearest cluster to `i`.
    + *
    + * Unfortunately, the naive implementation of the algorithm requires to compute
    + * the distance of each couple of points in the dataset. Since the computation of
    + * the distance measure takes `D` operations - if `D` is the number of dimensions
    + * of each point, the computational complexity of the algorithm is `O(N^2^*D)`, where
    + * `N` is the cardinality of the dataset. Of course this is not scalable in `N`,
    + * which is the critical number in a Big Data context.
    + *
    + * The algorithm which is implemented in this object, instead, is an efficient
    + * and parallel implementation of the Silhouette using the squared Euclidean
    + * distance measure.
    + *
    + * With this assumption, the average of the distance of the point `X`
    + * to the points `$C_{i}$` belonging to the cluster `$\Gamma$` is:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N d(X, C_{i} )^2 =
    + *   \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D (x_{j}-c_{ij})^2 \Big)
    + *   = \sum\limits_{i=1}^N \Big( \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{j=1}^D c_{ij}^2 -2\sum\limits_{j=1}^D x_{i}c_{ij} \Big)
    + *   = \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 +
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2
    + *   -2 \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij}
    + *   $$
    + * </blockquote>
    + *
    + * where `$x_{j}$` is the `j`-th dimension of the point `X` and
    + * `$c_{ij}$` is the `j`-th dimension of the `i`-th point in cluster `$\Gamma$`.
    + *
    + * Then, the first term of the equation can be rewritten as:
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{j}^2 = N \xi_{X} \text{ ,
    + *   with } \xi_{X} = \sum\limits_{j=1}^D x_{j}^2
    + *   $$
    + * </blockquote>
    + *
    + * where `$\xi_{X}$` is fixed for each point and it can be precomputed.
    + *
    + * Moreover, the second term is fixed for each cluster too,
    + * thus we can name it `$\Psi_{\Gamma}$`
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D c_{ij}^2 =
    + *   \sum\limits_{i=1}^N \xi_{C_{i}} = \Psi_{\Gamma}
    + *   $$
    + * </blockquote>
    + *
    + * Last, the third element becomes
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{i=1}^N \sum\limits_{j=1}^D x_{i}c_{ij} =
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * thus defining the vector
    + *
    + * <blockquote>
    + *   $$
    + *   Y_{\Gamma}:Y_{\Gamma j} = \sum\limits_{i=1}^N c_{ij} , j=0, ..., D
    + *   $$
    + * </blockquote>
    + *
    + * which is fixed for each cluster `$\Gamma$`, we have
    + *
    + * <blockquote>
    + *   $$
    + *   \sum\limits_{j=1}^D \Big(\sum\limits_{i=1}^N c_{ij} \Big) x_{i} =
    + *   \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * In this way, the previous equation becomes
    + *
    + * <blockquote>
    + *   $$
    + *   N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}
    + *   $$
    + * </blockquote>
    + *
    + * and the distance of a point to a cluster can be computed as
    + *
    + * <blockquote>
    + *   $$
    + *   \frac{\sum\limits_{i=1}^N d(X, C_{i} )^2}{N} =
    + *   \frac{N\xi_{X} + \Psi_{\Gamma} - 2 \sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N} =
    + *   \xi_{X} + \frac{\Psi_{\Gamma} }{N} - 2 \frac{\sum\limits_{j=1}^D Y_{\Gamma j} x_{i}}{N}
    + *   $$
    + * </blockquote>
    + *
    + * Thus, it is enough to precompute the constant `$\xi_{X}$` for each point `X`
    + * and the constants `$\Psi_{\Gamma}$` and `N` and the vector `$Y_{\Gamma}$` for
    + * each cluster `$\Gamma$`.
    + *
    + * In the implementation, the precomputed values for the clusters
    + * are distributed among the worker nodes via broadcasted variables,
    + * because we can assume that the clusters are limited in number and
    + * anyway they are much fewer than the points.
    + *
    + * The main strengths of this algorithm are the low computational complexity
    + * and the intrinsic parallelism. The precomputed information for each point
    + * and for each cluster can be computed with a computational complexity
    + * which is `O(N/W)`, where `N` is the number of points in the dataset and
    + * `W` is the number of worker nodes. After that, every point can be
    + * analyzed independently of the others.
    + *
    + * For every point we need to compute the average distance to all the clusters.
    + * Since the formula above requires `O(D)` operations, this phase has a
    + * computational complexity which is `O(C*D*N/W)` where `C` is the number of
    + * clusters (which we assume quite low), `D` is the number of dimensions,
    + * `N` is the number of points in the dataset and `W` is the number
    + * of worker nodes.
    + */
    +private[evaluation] object SquaredEuclideanSilhouette {
    +
    +  private[this] var kryoRegistrationPerformed: Boolean = false
    +
    +  /**
    +   * This method registers the class
    +   * [[org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette.ClusterStats]]
    +   * for kryo serialization.
    +   *
    +   * @param sc `SparkContext` to be used
    +   */
    +  def registerKryoClasses(sc: SparkContext): Unit = {
    +    if (! kryoRegistrationPerformed) {
    +      sc.getConf.registerKryoClasses(
    +        Array(
    +          classOf[SquaredEuclideanSilhouette.ClusterStats]
    +        )
    +      )
    +      kryoRegistrationPerformed = true
    +    }
    +  }
    +
    +  case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long)
    +
    +  /**
    +   * The method takes the input dataset and computes the aggregated values
    +   * about a cluster which are needed by the algorithm.
    +   *
    +   * @param df The DataFrame which contains the input data
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return A [[scala.collection.immutable.Map]] which associates each cluster id
    +   *         to a [[ClusterStats]] object (which contains the precomputed values `N`,
    +   *         `\Psi_{\Gamma}` and `Y_{\Gamma}` for a cluster).
    +   */
    +  def computeClusterStats(
    +    df: DataFrame,
    +    predictionCol: String,
    +    featuresCol: String): Map[Int, ClusterStats] = {
    +    val numFeatures = df.select(col(featuresCol)).first().getAs[Vector](0).size
    +    val clustersStatsRDD = df.select(col(predictionCol), col(featuresCol), col("squaredNorm"))
    +      .rdd
    +      .map { row => (row.getInt(0), (row.getAs[Vector](1), row.getDouble(2))) }
    +      .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))(
    +        seqOp = {
    +          case (
    +              (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long),
    +              (features, squaredNorm)
    +            ) =>
    +            BLAS.axpy(1.0, features, featureSum)
    +            (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
    +        },
    +        combOp = {
    +          case (
    +              (featureSum1, squaredNormSum1, numOfPoints1),
    +              (featureSum2, squaredNormSum2, numOfPoints2)
    +            ) =>
    +            BLAS.axpy(1.0, featureSum2, featureSum1)
    +            (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2)
    +        }
    +      )
    +
    +    clustersStatsRDD
    +      .collectAsMap()
    +      .mapValues {
    +        case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) =>
    +          SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints)
    +      }
    +      .toMap
    +  }
    +
    +  /**
    +   * It computes the Silhouette coefficient for a point.
    +   *
    +   * @param broadcastedClustersMap A map of the precomputed values for each cluster.
    +   * @param features The [[org.apache.spark.ml.linalg.Vector]] representing the current point.
    +   * @param clusterId The id of the cluster the current point belongs to.
    +   * @param squaredNorm The `\Xi_{X}` (which is the squared norm) precomputed for the point.
    +   * @return The Silhouette for the point.
    +   */
    +  def computeSilhouetteCoefficient(
    +     broadcastedClustersMap: Broadcast[Map[Int, ClusterStats]],
    +     features: Vector,
    +     clusterId: Int,
    +     squaredNorm: Double): Double = {
    +
    +    def compute(squaredNorm: Double, point: Vector, clusterStats: ClusterStats): Double = {
    +      val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum)
    +
    +      squaredNorm +
    +        clusterStats.squaredNormSum / clusterStats.numOfPoints -
    +        2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints
    +    }
    +
    +    // Here we compute the average dissimilarity of the
    +    // current point to any cluster of which the point
    +    // is not a member.
    +    // The cluster with the lowest average dissimilarity
    +    // - i.e. the nearest cluster to the current point -
    +    // is said to be the "neighboring cluster".
    +    var neighboringClusterDissimilarity = Double.MaxValue
    +    broadcastedClustersMap.value.keySet.foreach {
    +      c =>
    +        if (c != clusterId) {
    +          val dissimilarity = compute(squaredNorm, features, broadcastedClustersMap.value(c))
    +          if(dissimilarity < neighboringClusterDissimilarity) {
    +            neighboringClusterDissimilarity = dissimilarity
    +          }
    +        }
    +    }
    +    val currentCluster = broadcastedClustersMap.value(clusterId)
    +    // adjustment for excluding the node itself from
    +    // the computation of the average dissimilarity
    +    val currentClusterDissimilarity = if (currentCluster.numOfPoints == 1) {
    +      0
    +    } else {
    +      compute(squaredNorm, features, currentCluster) * currentCluster.numOfPoints /
    +        (currentCluster.numOfPoints - 1)
    +    }
    +
    +    (currentClusterDissimilarity compare neighboringClusterDissimilarity).signum match {
    +      case -1 => 1 - (currentClusterDissimilarity / neighboringClusterDissimilarity)
    +      case 1 => (neighboringClusterDissimilarity / currentClusterDissimilarity) - 1
    +      case 0 => 0.0
    +    }
    +  }
    +
    +  /**
    +   * Compute the mean Silhouette values of all samples.
    +   *
    +   * @param dataset The input dataset (previously clustered) on which compute the Silhouette.
    +   * @param predictionCol The name of the column which contains the cluster id for the point.
    +   * @param featuresCol The name of the column which contains the feature vector of the point.
    +   * @return The average of the Silhouette values of the clustered data.
    +   */
    +  def computeSilhouetteScore(dataset: Dataset[_],
    +      predictionCol: String,
    +      featuresCol: String): Double = {
    +    SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext)
    +
    +    val squaredNormUDF = udf {
    +      features: Vector => math.pow(Vectors.norm(features, 2.0), 2.0)
    +    }
    +    val dfWithSquaredNorm = dataset.withColumn("squaredNorm", squaredNormUDF(col(featuresCol)))
    +
    +    // compute aggregate values for clusters
    +    // needed by the algorithm
    --- End diff --
    
    Merge them to a single line.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136332399
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   s_{i}=\left\{ \begin{tabular}{cc}
    + *   $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
    + *   $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$
    --- End diff --
    
    1, Remove ```private[evaluation]``` from ```object SquaredEuclideanSilhouette```. We only generate docs for public APIs, the doc of private APIs are used for developers to understand code.
    2, ```cd docs```
    3, Run ```jekyll build```
    4, Then you can get API docs under ```docs/_site/api/scala/index.html```, try to search ```SquaredEuclideanSilhouette```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r136305819
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,379 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + *
    + * The implementation follows the proposal explained
    + * <a href="https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view">
    + * in this document</a>.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator (val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    SquaredEuclideanSilhouette.computeSilhouetteScore(
    +      dataset,
    +      $(predictionCol),
    +      $(featuresCol)
    +    )
    +  }
    +}
    +
    +
    +object ClusteringEvaluator
    +  extends DefaultParamsReadable[ClusteringEvaluator] {
    +
    +  override def load(path: String): ClusteringEvaluator = super.load(path)
    +
    +}
    +
    +
    +/**
    + * SquaredEuclideanSilhouette computes the average of the
    + * Silhouette over all the data of the dataset, which is
    + * a measure of how appropriately the data have been clustered.
    + *
    + * The Silhouette for each point `i` is defined as:
    + *
    + * <blockquote>
    + *   s_{i} = \frac{b_{i}-a_{i}}{max\{a_{i},b_{i}\}}
    + * </blockquote>
    + *
    + * which can be rewritten as
    + *
    + * <blockquote>
    + *   s_{i}=\left\{ \begin{tabular}{cc}
    + *   $1-\frac{a_{i}}{b_{i}}$ & if $a_{i} \leq b_{i}$ \\
    + *   $\frac{b_{i}}{a_{i}}-1$ & if $a_{i} \gt b_{i}$
    --- End diff --
    
    There is syntax error in this latex formula, I checked the generated doc and found it can't show correctly. Or you can paste this formula into http://www.hostmath.com/ to check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138255474
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,438 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"silhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("silhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (silhouette)",
    +      allowedParams
    +    )
    +  }
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getMetricName: String = $(metricName)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setMetricName(value: String): this.type = set(metricName, value)
    +
    +  setDefault(metricName -> "silhouette")
    +
    +  @Since("2.3.0")
    +  override def evaluate(dataset: Dataset[_]): Double = {
    +    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
    +    SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType)
    +
    +    $(metricName) match {
    +      case "silhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore(
    +        dataset,
    +        $(predictionCol),
    +        $(featuresCol)
    +      )
    --- End diff --
    
    Reorg as:
    ```
    $(metricName) match {
          case "squaredSilhouette" =>
            SquaredEuclideanSilhouette.computeSilhouetteScore(
              dataset, $(predictionCol), $(featuresCol))
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    **[Test build #80453 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80453/testReport)** for PR 18538 at commit [`ffc17f9`](https://github.com/apache/spark/commit/ffc17f929dd86d1e7e73931eac5663bc08b6ba7a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18538#discussion_r138256035
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -0,0 +1,438 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.evaluation
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT}
    +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions.{avg, col, udf}
    +import org.apache.spark.sql.types.IntegerType
    +
    +/**
    + * :: Experimental ::
    + * Evaluator for clustering results.
    + * The metric computes the Silhouette measure
    + * using the squared Euclidean distance.
    + *
    + * The Silhouette is a measure for the validation
    + * of the consistency within clusters. It ranges
    + * between 1 and -1, where a value close to 1
    + * means that the points in a cluster are close
    + * to the other points in the same cluster and
    + * far from the points of the other clusters.
    + */
    +@Experimental
    +@Since("2.3.0")
    +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    +  extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("cluEval"))
    +
    +  @Since("2.3.0")
    +  override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap)
    +
    +  @Since("2.3.0")
    +  override def isLargerBetter: Boolean = true
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setPredictionCol(value: String): this.type = set(predictionCol, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
    +
    +  /**
    +   * param for metric name in evaluation
    +   * (supports `"silhouette"` (default))
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val metricName: Param[String] = {
    +    val allowedParams = ParamValidators.inArray(Array("silhouette"))
    +    new Param(
    +      this,
    +      "metricName",
    +      "metric name in evaluation (silhouette)",
    +      allowedParams
    +    )
    --- End diff --
    
    Reorg as:
    ```
    val metricName: Param[String] = {
        val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
        new Param(
          this, "metricName", "metric name in evaluation (squaredSilhouette)", allowedParams)
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18538
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org