You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ajtulloch <gi...@git.apache.org> on 2014/05/11 01:31:44 UTC

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

GitHub user ajtulloch opened a pull request:

    https://github.com/apache/spark/pull/725

    SPARK-1791 - SVM implementation does not use threshold parameter

    Summary:
    https://issues.apache.org/jira/browse/SPARK-1791
    
    Simple fix, and backward compatible, since
    
    - anyone who set the threshold was getting completely wrong answers.
    - anyone who did not set the threshold had the default 0.0 value for the threshold anyway.
    
    Test Plan:
    Unit test added that is verified to fail under the old implementation,
    and pass under the new implementation.
    
    Reviewers:
    
    CC:

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajtulloch/spark SPARK-1791-SVM

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #725
    
----
commit 6f7075a75927b08c3c1211642632e03c2246f5cd
Author: Andrew Tulloch <an...@tullo.ch>
Date:   2014-05-10T23:22:02Z

    SPARK-1791 - SVM implementation does not use threshold parameter
    
    Summary:
    https://issues.apache.org/jira/browse/SPARK-1791
    
    Simple fix, and backward compatible, since
    
    - anyone who set the threshold was getting completely wrong answers.
    - anyone who did not set the threshold had the default 0.0 value for the threshold anyway.
    
    Test Plan:
    Unit test added that is verified to fail under the old implementation,
    and pass under the new implementation.
    
    Reviewers:
    
    CC:

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-42758415
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14879/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/725#discussion_r12594467
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/SVMSuite.scala ---
    @@ -69,6 +69,43 @@ class SVMSuite extends FunSuite with LocalSparkContext {
         assert(numOffPredictions < input.length / 5)
       }
     
    +  test("SVM with threshold") {
    +    val nPoints = 10000
    +
    +    // NOTE: Intercept should be small for generating equal 0s and 1s
    +    val A = 0.01
    +    val B = -1.5
    +    val C = 1.0
    +
    +    val testData = SVMSuite.generateSVMInput(A, Array[Double](B,C), nPoints, 42)
    +
    +    val testRDD = sc.parallelize(testData, 2)
    +    testRDD.cache()
    +
    +    val svm = new SVMWithSGD().setIntercept(true)
    +    svm.optimizer.setStepSize(1.0).setRegParam(1.0).setNumIterations(100)
    +
    +    val model = svm.run(testRDD)
    +
    +    val validationData = SVMSuite.generateSVMInput(A, Array[Double](B,C), nPoints, 17)
    --- End diff --
    
    Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by ajtulloch <gi...@git.apache.org>.
Github user ajtulloch commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-43020578
  
    @mengxr - thanks for the comments, I've updated with the fixes.  Please also have a look at https://github.com/apache/spark/pull/726 which cleans up formatting, code duplication, etc in the `SVMSuite.scala` file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-43029884
  
    LGTM. Thanks! I will take a look at #726 today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-43020506
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/725#discussion_r12594596
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/SVMSuite.scala ---
    @@ -69,6 +69,43 @@ class SVMSuite extends FunSuite with LocalSparkContext {
         assert(numOffPredictions < input.length / 5)
       }
     
    +  test("SVM with threshold") {
    +    val nPoints = 10000
    +
    +    // NOTE: Intercept should be small for generating equal 0s and 1s
    +    val A = 0.01
    +    val B = -1.5
    +    val C = 1.0
    +
    +    val testData = SVMSuite.generateSVMInput(A, Array[Double](B,C), nPoints, 42)
    +
    +    val testRDD = sc.parallelize(testData, 2)
    +    testRDD.cache()
    +
    +    val svm = new SVMWithSGD().setIntercept(true)
    +    svm.optimizer.setStepSize(1.0).setRegParam(1.0).setNumIterations(100)
    +
    +    val model = svm.run(testRDD)
    +
    +    val validationData = SVMSuite.generateSVMInput(A, Array[Double](B,C), nPoints, 17)
    +    val validationRDD  = sc.parallelize(validationData, 2)
    +
    +    // Test prediction on RDD.
    +
    +    var predictions = model.predict(validationRDD.map(_.features)).collect()
    +    assert(predictions.count {_ == 0.0 } != predictions.length)
    --- End diff --
    
    Change to `count(_ == 0.0)`, which is more common in Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-42981423
  
    ah, I made this ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-42757678
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-43020526
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-42757676
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-43029934
  
    Thanks. Merged this into master & branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/725#discussion_r12594375
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/classification/SVMSuite.scala ---
    @@ -69,6 +69,43 @@ class SVMSuite extends FunSuite with LocalSparkContext {
         assert(numOffPredictions < input.length / 5)
       }
     
    +  test("SVM with threshold") {
    +    val nPoints = 10000
    +
    +    // NOTE: Intercept should be small for generating equal 0s and 1s
    +    val A = 0.01
    +    val B = -1.5
    +    val C = 1.0
    +
    +    val testData = SVMSuite.generateSVMInput(A, Array[Double](B,C), nPoints, 42)
    --- End diff --
    
    Please put a `,` after `B`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-43023787
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14949/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-42758411
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/725


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1791 - SVM implementation does not use t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/725#issuecomment-43023786
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---