You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2017/11/06 15:52:58 UTC

[GitHub] spark pull request #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluato...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/19676

    [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to examples

    ## What changes were proposed in this pull request?
    
    In SPARK-14516 we have introduced ClusteringEvaluator, but we didn't put any reference in the documentation and the examples were still relying on the sum of squared errors to show a way to evaluate the clustering model.
    
    The PR adds the ClusteringEvaluator in the examples.
    
    ## How was this patch tested?
    
    Manual runs of the examples.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-14516_examples

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19676.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19676
    
----
commit 4c4f83e97d9bd2d8771452498581bf9ce43bd28d
Author: Marco Gaido <mg...@hortonworks.com>
Date:   2017-11-06T15:49:17Z

    [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to examples

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    **[Test build #83500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83500/testReport)** for PR 19676 at commit [`4c4f83e`](https://github.com/apache/spark/commit/4c4f83e97d9bd2d8771452498581bf9ce43bd28d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    sorry for pinging you, what do you think about adding `ClusteringEvaluator` to the examples @yanboliang ? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluato...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19676#discussion_r155928871
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java ---
    @@ -51,9 +52,14 @@ public static void main(String[] args) {
         KMeans kmeans = new KMeans().setK(2).setSeed(1L);
         KMeansModel model = kmeans.fit(dataset);
     
    -    // Evaluate clustering by computing Within Set Sum of Squared Errors.
    -    double WSSSE = model.computeCost(dataset);
    -    System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
    +    // Make predictions
    +    Dataset<Row> predictions = model.transform(dataset);
    +
    +    // Evaluate clustering by computing Silhouette score
    +    ClusteringEvaluator evaluator = new ClusteringEvaluator();
    +
    +    double silhouette = evaluator.evaluate(predictions);
    +    System.out.println("Silhouette with squared euclidean distance = " + silhouette);
    --- End diff --
    
    euclidean -> Euclidean, but not important to change unless you're touching the code again anyway


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    **[Test build #84681 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84681/testReport)** for PR 19676 at commit [`feb619d`](https://github.com/apache/spark/commit/feb619d657f6ff66dec240ee4619e6f53208ac18).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluato...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19676


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluato...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19676#discussion_r155929522
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java ---
    @@ -51,9 +52,14 @@ public static void main(String[] args) {
         KMeans kmeans = new KMeans().setK(2).setSeed(1L);
         KMeansModel model = kmeans.fit(dataset);
     
    -    // Evaluate clustering by computing Within Set Sum of Squared Errors.
    -    double WSSSE = model.computeCost(dataset);
    -    System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
    +    // Make predictions
    +    Dataset<Row> predictions = model.transform(dataset);
    +
    +    // Evaluate clustering by computing Silhouette score
    +    ClusteringEvaluator evaluator = new ClusteringEvaluator();
    +
    +    double silhouette = evaluator.evaluate(predictions);
    +    System.out.println("Silhouette with squared euclidean distance = " + silhouette);
    --- End diff --
    
    Thanks, I don't think I am changing the code again, but I can fix this grammatical error if you want.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84681/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    It's good to have this, sorry for late response, I will make a pass tomorrow. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluato...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19676#discussion_r155913190
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaKMeansExample.java ---
    @@ -51,9 +52,17 @@ public static void main(String[] args) {
         KMeans kmeans = new KMeans().setK(2).setSeed(1L);
         KMeansModel model = kmeans.fit(dataset);
     
    -    // Evaluate clustering by computing Within Set Sum of Squared Errors.
    -    double WSSSE = model.computeCost(dataset);
    -    System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
    +    // Make predictions
    +    Dataset<Row> predictions = model.transform(dataset);
    +
    +    // Evaluate clustering by computing Silhouette score
    +    ClusteringEvaluator evaluator = new ClusteringEvaluator()
    +      .setFeaturesCol("features")
    +      .setPredictionCol("prediction")
    --- End diff --
    
    We use default values here, so it's not necessary to set them explicitly. We should keep examples as simple as possible. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    **[Test build #83500 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83500/testReport)** for PR 19676 at commit [`4c4f83e`](https://github.com/apache/spark/commit/4c4f83e97d9bd2d8771452498581bf9ce43bd28d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83500/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19676: [SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to ex...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19676
  
    **[Test build #84681 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84681/testReport)** for PR 19676 at commit [`feb619d`](https://github.com/apache/spark/commit/feb619d657f6ff66dec240ee4619e6f53208ac18).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org