You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/02/16 09:27:11 UTC

[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/20627

    [SPARK-23217][ML][PYTHON] Add distanceMeasure param to ClusteringEvaluator Python API

    ## What changes were proposed in this pull request?
    
    The PR adds the `distanceMeasure` param to ClusteringEvaluator in the Python API. This allows the user to specify `cosine` as distance measure in addition to the default `squaredEuclidean`.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-23217_python

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20627.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20627
    
----
commit 8fe8efaaf0202f804e80b36ec11b43d5aa34d511
Author: Marco Gaido <ma...@...>
Date:   2018-02-16T09:24:45Z

    [SPARK-23217][ML][PYTHON] Add distanceMeasure param to ClusteringEvaluator Python API

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/966/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20627#discussion_r169090021
  
    --- Diff: python/pyspark/ml/evaluation.py ---
    @@ -394,15 +397,30 @@ def getMetricName(self):
         @keyword_only
         @since("2.3.0")
         def setParams(self, predictionCol="prediction", featuresCol="features",
    -                  metricName="silhouette"):
    +                  metricName="silhouette", distanceMeasure="squaredEuclidean"):
    --- End diff --
    
    Just so I'm clear, this doesn't cause any API incompatibility in Python in the way it would for Scala right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87553/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20627#discussion_r169156415
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -541,6 +541,16 @@ def test_java_params(self):
             self.assertEqual(evaluator._java_obj.getMetricName(), "r2")
             self.assertEqual(evaluatorCopy._java_obj.getMetricName(), "mae")
     
    +    def test_clustering_evaluator_with_cosine_distance(self):
    +        featureAndPredictions = map(lambda x: (Vectors.dense(x[0]), x[1]),
    +                                    [([1.0, 1.0], 1.0), ([10.0, 10.0], 1.0), ([1.0, 0.5], 2.0),
    +                                     ([10.0, 4.4], 2.0), ([-1.0, 1.0], 3.0), ([-100.0, 90.0], 3.0)])
    +        dataset = self.spark.createDataFrame(featureAndPredictions, ["features", "prediction"])
    +        evaluator = ClusteringEvaluator(predictionCol="prediction", distanceMeasure="cosine")
    +        self.assertEqual(evaluator.getDistanceMeasure(), "cosine")
    +        self.assertEqual(round(evaluator.evaluate(dataset), 5),  0.99267)
    --- End diff --
    
    it would be better to use `np.allclose` rather than rounding.  Check out some other tests here for the usage


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20627


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/935/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20627#discussion_r169116845
  
    --- Diff: python/pyspark/ml/evaluation.py ---
    @@ -394,15 +397,30 @@ def getMetricName(self):
         @keyword_only
         @since("2.3.0")
         def setParams(self, predictionCol="prediction", featuresCol="features",
    -                  metricName="silhouette"):
    +                  metricName="silhouette", distanceMeasure="squaredEuclidean"):
    --- End diff --
    
    I am not expert enough in Python to answer this. A normal python script is interpreted at runtime, so source code compatibility should be enough, but a `.pyc` file might have the same issue in theory.
    I'd love to hear the answer from someone with more experience on Python.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20627#discussion_r169158234
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -541,6 +541,16 @@ def test_java_params(self):
             self.assertEqual(evaluator._java_obj.getMetricName(), "r2")
             self.assertEqual(evaluatorCopy._java_obj.getMetricName(), "mae")
     
    +    def test_clustering_evaluator_with_cosine_distance(self):
    +        featureAndPredictions = map(lambda x: (Vectors.dense(x[0]), x[1]),
    +                                    [([1.0, 1.0], 1.0), ([10.0, 10.0], 1.0), ([1.0, 0.5], 2.0),
    +                                     ([10.0, 4.4], 2.0), ([-1.0, 1.0], 3.0), ([-100.0, 90.0], 3.0)])
    +        dataset = self.spark.createDataFrame(featureAndPredictions, ["features", "prediction"])
    +        evaluator = ClusteringEvaluator(predictionCol="prediction", distanceMeasure="cosine")
    +        self.assertEqual(evaluator.getDistanceMeasure(), "cosine")
    +        self.assertEqual(round(evaluator.evaluate(dataset), 5),  0.99267)
    +        self.assertEqual(evaluator._java_obj.getDistanceMeasure(), "cosine")
    --- End diff --
    
    I don't think it is great to check the parameter in the java obj too.  If this was done for every param, it would be a ton of check and make it impossible to possibly change this plumbing in the future.
    
    There is an existing test that runs some basic param checks that should be sufficient, but it doesn't look like the evaluation module is part of it, can you add it?  It's in `DefaultValuesTests.test_java_params`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20627#discussion_r169156894
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -52,7 +52,7 @@
     from pyspark.ml.clustering import *
     from pyspark.ml.common import _java2py, _py2java
     from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
    -    MulticlassClassificationEvaluator, RegressionEvaluator
    +    MulticlassClassificationEvaluator, RegressionEvaluator, ClusteringEvaluator
    --- End diff --
    
    fix import ordering


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    LGTM, thanks for making the addition to java param tests @mgaido91 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    **[Test build #87507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87507/testReport)** for PR 20627 at commit [`8fe8efa`](https://github.com/apache/spark/commit/8fe8efaaf0202f804e80b36ec11b43d5aa34d511).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    **[Test build #87507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87507/testReport)** for PR 20627 at commit [`8fe8efa`](https://github.com/apache/spark/commit/8fe8efaaf0202f804e80b36ec11b43d5aa34d511).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    OK by you @BryanCutler ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    **[Test build #87553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87553/testReport)** for PR 20627 at commit [`6b379ef`](https://github.com/apache/spark/commit/6b379eff4136226ca7a7fd9e5820c20f475cd828).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    **[Test build #87553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87553/testReport)** for PR 20627 at commit [`6b379ef`](https://github.com/apache/spark/commit/6b379eff4136226ca7a7fd9e5820c20f475cd828).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure par...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20627#discussion_r169156086
  
    --- Diff: python/pyspark/ml/evaluation.py ---
    @@ -394,15 +397,30 @@ def getMetricName(self):
         @keyword_only
         @since("2.3.0")
         def setParams(self, predictionCol="prediction", featuresCol="features",
    -                  metricName="silhouette"):
    +                  metricName="silhouette", distanceMeasure="squaredEuclidean"):
    --- End diff --
    
    Yeah these will not cause any incompatibility issues, the interpreter would automatically just pass the additional keyword arg.  These methods are actually all wrapped with `keyword_only` which make it look externally like `setParams(**kwargs)`, anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    cc @srowen @BryanCutler 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87507/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20627: [SPARK-23217][ML][PYTHON] Add distanceMeasure param to C...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20627
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org