You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/02/16 16:54:41 UTC

[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/20629

    [SPARK-23451][ML] Deprecate KMeans.computeCost

    ## What changes were proposed in this pull request?
    
    Deprecate `KMeans.computeCost` which was introduced as a temp fix and now it is not needed anymore, since we introduced `ClusteringEvaluator`.
    
    ## How was this patch tested?
    
    manual test (deprecation warning displayed)
    Scala
    ```
    ...
    scala> model.computeCost(dataset)
    warning: there was one deprecation warning; re-run with -deprecation for details
    res1: Double = 0.0
    ```
    
    Python
    ```
    >>> import warnings
    >>> warnings.simplefilter('always', DeprecationWarning)
    ...
    >>> model.computeCost(df)
    /Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead.
      " instead.", DeprecationWarning)
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-23451

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20629.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20629
    
----
commit 2f79bb2d5c7e29e85a4a7abe63254d392a49fe53
Author: Marco Gaido <ma...@...>
Date:   2018-02-16T16:03:09Z

    [SPARK-23451][ML] Deprecate KMeans.computeCost

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r181471237
  
    --- Diff: python/pyspark/ml/clustering.py ---
    @@ -322,7 +323,11 @@ def computeCost(self, dataset):
             """
             Return the K-means cost (sum of squared distances of points to their nearest center)
             for this model on the given data.
    +
    +        ..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead.
             """
    +        warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator"
    --- End diff --
    
    If we do go this path we need to file a follow up JIRA to update Python ClusteringEvaluator.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #92476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92476/testReport)** for PR 20629 at commit [`5a5c31f`](https://github.com/apache/spark/commit/5a5c31fb43b809d2324d27cad1796f8aae8549ce).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #87510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87510/testReport)** for PR 20629 at commit [`2f79bb2`](https://github.com/apache/spark/commit/2f79bb2d5c7e29e85a4a7abe63254d392a49fe53).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/585/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    yes, I agree with you @MLnick. I'd like to ping also @jkbradley and @hhbyyh who drove the PR which introduce `computeCost`.
    
    In order to move the same metric to `ClusteringEvaluator`, I think it is very useful what is done in this ongoing PR #20600, where `DistanceMeasure` is moved to its own file and the `cost` method is added to it. So maybe it can be added there after that PR. What do you think?
    
    PS if you agree, may you please then help reviewing that PR?
    
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/937/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #89205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89205/testReport)** for PR 20629 at commit [`ca8c2ec`](https://github.com/apache/spark/commit/ca8c2ece6a82b0ab805dacf3606333f3baa41b32).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    +1 for @mgaido91's plan


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    @holdenk @sethah any more comments on this? As 2.4 release is approaching I think it would be great to have this in, so we can remove the deprecated APIs in 3.0... Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r195153337
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str
     
       /**
        * param for metric name in evaluation
    -   * (supports `"silhouette"` (default))
    +   * (supports `"silhouette"` (default), `"kmeansCost"`)
    --- End diff --
    
    Generally speaking I think it would make sense to maintain the fall-back metric until at least Spark 3.0 at which point I think it would make sense to ask on the user and dev lists and see if anyone is hard dependencies on it or if it is safe to remove.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #88307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88307/testReport)** for PR 20629 at commit [`05680ea`](https://github.com/apache/spark/commit/05680ea59495a8a2c5ba15f18f77559b0cc81b98).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88307/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #93016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93016/testReport)** for PR 20629 at commit [`926c353`](https://github.com/apache/spark/commit/926c35309e39b9137f6637a79f64bd22f6da84e0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged to master :) Thank you :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89207/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1562/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    So when you say "second pass over the data" - from looking at this it seems like it would could do this with just a second map to look up the predictions in the already computed cluster centers, not a stage boundary, so that probably wouldn't be all that expensive given how Spark does pipe-lining unless I'm mussing something.
    
    This would mean that we'd have to have people set the cluster centers from their model when they wanted to do that evaluation type but given that the evaluate wouldn't be able to recover the cluster centers from a test that differed from the training set I think that would be reasonable.
    
    That being said its been awhile since I've looked at the evaluator code so I could be coming out of left field.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20629


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #89207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89207/testReport)** for PR 20629 at commit [`9c8cc67`](https://github.com/apache/spark/commit/9c8cc67d03e7f21b15125ad409872f794893924f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    thanks for taking a look at this @MLnick. No, it doesn't, in the sense that it returns a different result: this is the sum of the squared euclidean distance between a point and the centroid of the cluster it is assigned to, while the silhouette metric is the average of the silhouette coefficient. So they are completely different formulas.
    
    The semantic is a bit different too. Silhouette measures both cohesion and separation of the clusters, while `computeCost` as it is measures only cohesion.
    
    Nonetheless, of course both them can be used to evaluate the result of a clustering algorithm, even though the silhouette is much better for this purpose.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Thanks @holdenk @sethah, then I'll update this PR accordingly. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    kindly ping @MLnick  @jkbradley @hhbyyh 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r202425169
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala ---
    @@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession}
      */
     @Since("0.8.0")
     class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector],
    -  @Since("2.4.0") val distanceMeasure: String)
    +  @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double)
    --- End diff --
    
    This constructor was introduced by a previous PR for 2.4, so this was never out (therefore I think we don't need to keep it).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    kindly ping @MLnick @jkbradley @hhbyyh


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    cc @sethah 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2222/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Just want to check - does `computeCost` do the same thing as the silhouette metric?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r181547119
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str
     
       /**
        * param for metric name in evaluation
    -   * (supports `"silhouette"` (default))
    +   * (supports `"silhouette"` (default), `"kmeansCost"`)
    --- End diff --
    
    but does it make sense to introduce something which is already considered legacy when introduced?
    
    I think this brigs up again the question: shall we maintain a metric which was introduced only temporary as a fallback due to the lack of better metrics?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    After I merged your other PR this has some minor conflicts so needs an update, but I'd be happy to try and get this in end of next week during my next review session :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #88307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88307/testReport)** for PR 20629 at commit [`05680ea`](https://github.com/apache/spark/commit/05680ea59495a8a2c5ba15f18f77559b0cc81b98).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    kindly ping @holdenk @MLnick


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    ping @sethah? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92476/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r195340437
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str
     
       /**
        * param for metric name in evaluation
    -   * (supports `"silhouette"` (default))
    +   * (supports `"silhouette"` (default), `"kmeansCost"`)
    --- End diff --
    
    ok, I agree. Let's go on this way then, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Sorry I mean putting the metric in evaluator and then also deprecating
    computCost
    On Sun, 18 Feb 2018 at 20:41, Nick Pentreath <ni...@gmail.com>
    wrote:
    
    > Right - so while it’s perhaps a lower quality metric it is different. So I
    > wonder if deprecation is the right approach (vs say putting the within
    > cluster sum squares into ClusteringEvaluator).
    >
    > On Sun, 18 Feb 2018 at 20:35, Marco Gaido <no...@github.com>
    > wrote:
    >
    >> thanks for taking a look at this @MLnick <https://github.com/mlnick>.
    >> No, it doesn't, in the sense that it returns a different result: this is
    >> the sum of the squared euclidean distance between a point and the centroid
    >> of the cluster it is assigned to, while the silhouette metric is the
    >> average of the silhouette coefficient. So they are completely different
    >> formulas.
    >>
    >> The semantic is a bit different too. Silhouette measures both cohesion
    >> and separation of the clusters, while computeCost as it is measures only
    >> cohesion.
    >>
    >> Nonetheless, of course both them can be used to evaluate the result of a
    >> clustering algorithm, even though the silhouette is much better for this
    >> purpose.
    >>
    >> —
    >> You are receiving this because you were mentioned.
    >> Reply to this email directly, view it on GitHub
    >> <https://github.com/apache/spark/pull/20629#issuecomment-366536722>, or mute
    >> the thread
    >> <https://github.com/notifications/unsubscribe-auth/AA_SB3VPksj5f9QN4Zo4v16_15YCsQdsks5tWG2MgaJpZM4SIn8J>
    >> .
    >>
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r181547107
  
    --- Diff: python/pyspark/ml/clustering.py ---
    @@ -322,7 +323,11 @@ def computeCost(self, dataset):
             """
             Return the K-means cost (sum of squared distances of points to their nearest center)
             for this model on the given data.
    +
    +        ..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead.
             """
    +        warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator"
    --- End diff --
    
    yes, or I can also update it here once we establish for sure what the new API has to look like, as you prefer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92475/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Thank you for reviewing @holdenk ☺


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/587/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87510/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #92475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92475/testReport)** for PR 20629 at commit [`be954c6`](https://github.com/apache/spark/commit/be954c64658f311622a16f9aa2c9f717ec9e7f21).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #92477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92477/testReport)** for PR 20629 at commit [`701b98a`](https://github.com/apache/spark/commit/701b98a0e6c4e6f3867bb03c5285ec12f6c08513).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #92476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92476/testReport)** for PR 20629 at commit [`5a5c31f`](https://github.com/apache/spark/commit/5a5c31fb43b809d2324d27cad1796f8aae8549ce).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    @holdenk I am not sure I got 100% what you meant, so I'll try to answer but let me know if I missed something.
    
    The problem of doing 2 passes is related to cluster centers. The API of `ClusteringEvaluator` (as of any `Evaluator`) is very simple: it is has a method which gets a `Dataset` and returns a value. So, unlike the method here - which is part of the `KMeansModel` and it can get the cluster centers from it -, there is no clue about the cluster centers: computing them is easy but it requires a pass on the dataset (this is the extra pass I mentioned).
    
    An alternative to this is adding a `setClusterCenters` method on the `ClusteringEvaluator`, but I am not sure whether this is worth since they are needed only for this metric, while for the others so far (the Silhouette measure) they are useless. Moreover, this metric was introduced explicitly as a temp fix because we were missing any other (better) evaluation metric and it was supposed to be dismissed once a better evaluation metric would have been introduced (please see the related JIRA and PR). So I am not sure that introducing a new method specifically for this metric is a good idea.
    
    What do you think? Were you suggesting this second option?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    kindly ping @holdenk @MLnick 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r202423675
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala ---
    @@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession}
      */
     @Since("0.8.0")
     class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector],
    -  @Since("2.4.0") val distanceMeasure: String)
    +  @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double)
    --- End diff --
    
    Since we changed the constructor here, and since it is not private, we should provide a similar (and deprecated) constructor without training cost which calls this with the default value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93016/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    ping @sethah :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #92475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92475/testReport)** for PR 20629 at commit [`be954c6`](https://github.com/apache/spark/commit/be954c64658f311622a16f9aa2c9f717ec9e7f21).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Right - so while it’s perhaps a lower quality metric it is different. So I
    wonder if deprecation is the right approach (vs say putting the within
    cluster sum squares into ClusteringEvaluator).
    
    On Sun, 18 Feb 2018 at 20:35, Marco Gaido <no...@github.com> wrote:
    
    > thanks for taking a look at this @MLnick <https://github.com/mlnick>. No,
    > it doesn't, in the sense that it returns a different result: this is the
    > sum of the squared euclidean distance between a point and the centroid of
    > the cluster it is assigned to, while the silhouette metric is the average
    > of the silhouette coefficient. So they are completely different formulas.
    >
    > The semantic is a bit different too. Silhouette measures both cohesion and
    > separation of the clusters, while computeCost as it is measures only
    > cohesion.
    >
    > Nonetheless, of course both them can be used to evaluate the result of a
    > clustering algorithm, even though the silhouette is much better for this
    > purpose.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/20629#issuecomment-366536722>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AA_SB3VPksj5f9QN4Zo4v16_15YCsQdsks5tWG2MgaJpZM4SIn8J>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r181470189
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -64,12 +65,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str
     
       /**
        * param for metric name in evaluation
    -   * (supports `"silhouette"` (default))
    +   * (supports `"silhouette"` (default), `"kmeansCost"`)
    --- End diff --
    
    If we want to consider kmeansCost a legacy function lets call it out as such so new people don't start adding a hard dependency to it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #89205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89205/testReport)** for PR 20629 at commit [`ca8c2ec`](https://github.com/apache/spark/commit/ca8c2ece6a82b0ab805dacf3606333f3baa41b32).
     * This patch passes all tests.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89205/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #93016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93016/testReport)** for PR 20629 at commit [`926c353`](https://github.com/apache/spark/commit/926c35309e39b9137f6637a79f64bd22f6da84e0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    sorry @MLnick, what do you think about my previous comment? Any thoughts? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/962/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #89207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89207/testReport)** for PR 20629 at commit [`9c8cc67`](https://github.com/apache/spark/commit/9c8cc67d03e7f21b15125ad409872f794893924f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    @holdenk I am not sure about requiring or not cluster centers for this metric. On one side, since the `ClusteringEvaluator` should be a general interface for all clustering models and some of them don't provide cluster centers, it may be a good idea to compute them if necessary. On the other, does this metric make sense for any model other than KMeans? And computing the centers of the test dataset would lead to different results than the old API we are replacing. So I am not sure it is the right thing to do.
    
    Honestly, the more we go on the more my feeling is that we don't really need to move that metric here. We can just deprecate it saying that there are better metrics for evaluating a clustering available in the `ClusteringEvaluator` (namely the silhouette). In these way people can move away from using this metric.
    
    Moreover, sklearn - which is one of the most widespread tool - doesn't offer the ability of computing such a cost (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). The only thing sklearn offers is what it calls `inertia` (https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265), ie. the cost computed on the training set.
    
    So, I think the best option would be to follow what sklearn does:
    
     1 - Introducing in the `KMeansSummary` (or `KMeansModel` if you prefer) the cost attribute on the training set
     2 - deprecate this method redirecting to `ClusteringEvaluator` for better metrics and/or to the cost attribute introduced
    
    What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #92477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92477/testReport)** for PR 20629 at commit [`701b98a`](https://github.com/apache/spark/commit/701b98a0e6c4e6f3867bb03c5285ec12f6c08513).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    **[Test build #87510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87510/testReport)** for PR 20629 at commit [`2f79bb2`](https://github.com/apache/spark/commit/2f79bb2d5c7e29e85a4a7abe63254d392a49fe53).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r202436972
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala ---
    @@ -37,7 +37,7 @@ import org.apache.spark.sql.{Row, SparkSession}
      */
     @Since("0.8.0")
     class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector],
    -  @Since("2.4.0") val distanceMeasure: String)
    +  @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double)
    --- End diff --
    
    That's awesome, lets try and get that fixed up before 2.4 goes so we don't have to any workarounds :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20629#discussion_r181470948
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala ---
    @@ -84,12 +85,12 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str
     
       /**
        * param for distance measure to be used in evaluation
    -   * (supports `"squaredEuclidean"` (default), `"cosine"`)
    +   * (supports `"squaredEuclidean"` (default), `"cosine"`, `"euclidean"`)
    --- End diff --
    
    If some models only support some ditance measures we should make that clear in the docs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92477/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    @MLnick I checked and adding the `computeCosts` to `ClusteringEvaluator` has a small drawback: we have to compute the centers for each cluster and then we can compute the costs, which involved two passes on the dataset.
    
    Since this and that this evaluation looks not very useful in practice, is it worth according to you to add it nonetheless?
    
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2220/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20629
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org