You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by miccagiann <gi...@git.apache.org> on 2014/07/29 01:49:34 UTC

[GitHub] spark pull request: Added support for regularizer and intercection...

GitHub user miccagiann opened a pull request:

    https://github.com/apache/spark/pull/1624

    Added support for regularizer and intercection parameters for linear reg...

    ...ression method.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/miccagiann/spark new-branch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1624.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1624
    
----
commit 3ac8874c351a228b596dfac018eb1578876aebef
Author: Michael Giannakopoulos <mi...@gmail.com>
Date:   2014-07-28T02:23:06Z

    Added support for regularizer and intercection parameters for linear regression method.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949959
  
    Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1624


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621154
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    Not using enumerations for `regType` parameter anymore. Switched to string values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627440
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
           dataBytesJRDD: JavaRDD[Array[Byte]],
           numIterations: Int,
           stepSize: Double,
    +      regParam: Double,
    +      regType: String,
    +      intercept: Boolean,
           miniBatchFraction: Double,
           initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
    +    val lrAlg = new LinearRegressionWithSGD()
    +    lrAlg.setIntercept(intercept)
    +    lrAlg.optimizer
    +      .setNumIterations(numIterations)
    +      .setRegParam(regParam)
    +      .setStepSize(stepSize)
    +    if (regType == "l2")
    +      lrAlg.optimizer.setUpdater(new SquaredL2Updater)
    +    else if (regType == "l1")
    +      lrAlg.optimizer.setUpdater(new L1Updater)
    --- End diff --
    
    It is safer to add
    ~~~
        else if (regType != "none")
          throw IllegalArgumentException("...")
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50953756
  
    I re-opened the JIRA. Please use the same JIRA number for your new PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50943683
  
    I did not know. I will make the changes right now!
    
    
    On Fri, Aug 1, 2014 at 5:43 PM, Xiangrui Meng <no...@github.com>
    wrote:
    
    > @miccagiann <https://github.com/miccagiann> In case you didn't know,
    > today is the deadline for merging new features into v1.1. MLlib is less
    > strict than core but we tried to meet the deadline if possible. This PR is
    > nice to have in v1.1 . Do you have time to work on it today? Or I can make
    > the changes and send you an PR to merge? Please let me know which works
    > better for you. Thanks!
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1624#issuecomment-50938240>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952097
  
    QA tests have started for PR 1624. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50946926
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949076
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621165
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
           dataBytesJRDD: JavaRDD[Array[Byte]],
           numIterations: Int,
           stepSize: Double,
    +      regParam: Double,
    +      regType: String,
    +      intercept: Boolean,
           miniBatchFraction: Double,
           initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
    +    val lrAlg = new LinearRegressionWithSGD()
    +    lrAlg.setIntercept(intercept)
    +    lrAlg.optimizer.
    +      setNumIterations(numIterations).
    +      setRegParam(regParam).
    +      setStepSize(stepSize)
    +    if (regType == "SquaredUpdater")
    +      lrAlg.optimizer.setUpdater(new SquaredL2Updater)
    +    else if (regType == "L1Updater")
    +      lrAlg.optimizer.setUpdater(new L1Updater)
    --- End diff --
    
    Not using enumerations for `regType` parameter anymore. Switched to string values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627451
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    --- End diff --
    
    To be safe, we shouldn't change the order of arguments, we can append new arguments `regParam`, `regType`, and `intercept` at the end. So if user used `train(data, 100, 1.0, 1.0, np.zeros(n))`, it still works in the new version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15566835
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable {
       private val DENSE_MATRIX_MAGIC: Byte = 3
       private val LABELED_POINT_MAGIC: Byte = 4
     
    +  /**
    +   * Enumeration used to define the type of Regularizer
    +   * used for linear methods.
    +   */
    +  object RegularizerType extends Serializable {
    +    val L2 : Int = 0
    +    val L1 : Int = 1
    +    val NONE : Int = 2
    +  }
    --- End diff --
    
    Using strings with a clear doc should be sufficient. Then you can map the string to `L1Updater` or `SquaredUpdater` inside `PythonMLLibAPI`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15723595
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
           dataBytesJRDD: JavaRDD[Array[Byte]],
           numIterations: Int,
           stepSize: Double,
    +      regParam: Double,
    +      regType: String,
    +      intercept: Boolean,
           miniBatchFraction: Double,
           initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
    +    val lrAlg = new LinearRegressionWithSGD()
    +    lrAlg.setIntercept(intercept)
    +    lrAlg.optimizer
    +      .setNumIterations(numIterations)
    +      .setRegParam(regParam)
    +      .setStepSize(stepSize)
    +    if (regType == "l2")
    +      lrAlg.optimizer.setUpdater(new SquaredL2Updater)
    +    else if (regType == "l1")
    +      lrAlg.optimizer.setUpdater(new L1Updater)
    --- End diff --
    
    By adding the exception to the scala code, I am going to remove the `ValueError` exception used in the python code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949963
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15559527
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -120,6 +120,23 @@ def train(cls, data, iterations=100, step=1.0,
                 d._jrdd, iterations, step, miniBatchFraction, i)
             return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights)
     
    +    @classmethod
    +    def trainL2Opt(cls, data, iterations=100, step=1.0, regParam=1.0,
    --- End diff --
    
    Ok! I am working in this issue!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50705527
  
    Btw, you can use `@param` for the doc for an argument. An example can be found at https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg.py#L41


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50946978
  
    @miccagiann Thanks for updating the PR! LGTM and waiting for Jenkins ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949270
  
    QA results for PR 1624:<br>- This patch FAILED unit tests.<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17725/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621881
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    Yes! I fixed it in the `regression.py` file where I was calling the same function again and again. As far as `PythonMLLibAPI().trainLinearRegressionModelWithSGD` I implement there the logic as well... I am building right now and I will commit instantly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50953197
  
    QA results for PR 1624:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952102
  
    LGTM. Waiting for Jenkins ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15623446
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,27 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    --- End diff --
    
    In Python, the line width for docs should be less than 80 (or 78 to be safe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50435652
  
    QA tests have started for PR 1624. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50950106
  
    QA tests have started for PR 1624. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17729/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50951638
  
    Yes, you need to merge the latest master and resolve conflicts first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949232
  
    QA tests have started for PR 1624. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17725/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15623329
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,27 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater" or regType == "L1Updater" or regType == "NONE":
    --- End diff --
    
    You can change it to `l2`, `l1`, and `none` (simpler and lowercases)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621439
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    @miccagiann It may be easier if you send the string directly to `PythonMLLibAPI().trainLinearRegressionModelWithSGD` and implement the logic there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50708428
  
    I have applied the suggested changes! Please notify me if any more modifications should be performed. Thanks for all your help Xiangrui.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.

Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952504
  
    OK!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50438015
  
    QA results for PR 1624:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627450
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
    --- End diff --
    
    We use two empty lines to separate methods in pyspark. (I don't know the exact reason ...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---