You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by miccagiann <gi...@git.apache.org> on 2014/07/29 01:49:34 UTC

[GitHub] spark pull request: Added support for regularizer and intercection...

GitHub user miccagiann opened a pull request:

    https://github.com/apache/spark/pull/1624

    Added support for regularizer and intercection parameters for linear reg...

    ...ression method.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/miccagiann/spark new-branch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1624.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1624
    
----
commit 3ac8874c351a228b596dfac018eb1578876aebef
Author: Michael Giannakopoulos <mi...@gmail.com>
Date:   2014-07-28T02:23:06Z

    Added support for regularizer and intercection parameters for linear regression method.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949959
  
    Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1624


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621154
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    Not using enumerations for `regType` parameter anymore. Switched to string values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627440
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
           dataBytesJRDD: JavaRDD[Array[Byte]],
           numIterations: Int,
           stepSize: Double,
    +      regParam: Double,
    +      regType: String,
    +      intercept: Boolean,
           miniBatchFraction: Double,
           initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
    +    val lrAlg = new LinearRegressionWithSGD()
    +    lrAlg.setIntercept(intercept)
    +    lrAlg.optimizer
    +      .setNumIterations(numIterations)
    +      .setRegParam(regParam)
    +      .setStepSize(stepSize)
    +    if (regType == "l2")
    +      lrAlg.optimizer.setUpdater(new SquaredL2Updater)
    +    else if (regType == "l1")
    +      lrAlg.optimizer.setUpdater(new L1Updater)
    --- End diff --
    
    It is safer to add
    ~~~
        else if (regType != "none")
          throw IllegalArgumentException("...")
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50953756
  
    I re-opened the JIRA. Please use the same JIRA number for your new PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50943683
  
    I did not know. I will make the changes right now!
    
    
    On Fri, Aug 1, 2014 at 5:43 PM, Xiangrui Meng <no...@github.com>
    wrote:
    
    > @miccagiann <https://github.com/miccagiann> In case you didn't know,
    > today is the deadline for merging new features into v1.1. MLlib is less
    > strict than core but we tried to meet the deadline if possible. This PR is
    > nice to have in v1.1 . Do you have time to work on it today? Or I can make
    > the changes and send you an PR to merge? Please let me know which works
    > better for you. Thanks!
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1624#issuecomment-50938240>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952097
  
    QA tests have started for PR 1624. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50946926
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949076
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621165
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
           dataBytesJRDD: JavaRDD[Array[Byte]],
           numIterations: Int,
           stepSize: Double,
    +      regParam: Double,
    +      regType: String,
    +      intercept: Boolean,
           miniBatchFraction: Double,
           initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
    +    val lrAlg = new LinearRegressionWithSGD()
    +    lrAlg.setIntercept(intercept)
    +    lrAlg.optimizer.
    +      setNumIterations(numIterations).
    +      setRegParam(regParam).
    +      setStepSize(stepSize)
    +    if (regType == "SquaredUpdater")
    +      lrAlg.optimizer.setUpdater(new SquaredL2Updater)
    +    else if (regType == "L1Updater")
    +      lrAlg.optimizer.setUpdater(new L1Updater)
    --- End diff --
    
    Not using enumerations for `regType` parameter anymore. Switched to string values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627451
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    --- End diff --
    
    To be safe, we shouldn't change the order of arguments, we can append new arguments `regParam`, `regType`, and `intercept` at the end. So if user used `train(data, 100, 1.0, 1.0, np.zeros(n))`, it still works in the new version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15566835
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable {
       private val DENSE_MATRIX_MAGIC: Byte = 3
       private val LABELED_POINT_MAGIC: Byte = 4
     
    +  /**
    +   * Enumeration used to define the type of Regularizer
    +   * used for linear methods.
    +   */
    +  object RegularizerType extends Serializable {
    +    val L2 : Int = 0
    +    val L1 : Int = 1
    +    val NONE : Int = 2
    +  }
    --- End diff --
    
    Using strings with a clear doc should be sufficient. Then you can map the string to `L1Updater` or `SquaredUpdater` inside `PythonMLLibAPI`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15723595
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
           dataBytesJRDD: JavaRDD[Array[Byte]],
           numIterations: Int,
           stepSize: Double,
    +      regParam: Double,
    +      regType: String,
    +      intercept: Boolean,
           miniBatchFraction: Double,
           initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
    +    val lrAlg = new LinearRegressionWithSGD()
    +    lrAlg.setIntercept(intercept)
    +    lrAlg.optimizer
    +      .setNumIterations(numIterations)
    +      .setRegParam(regParam)
    +      .setStepSize(stepSize)
    +    if (regType == "l2")
    +      lrAlg.optimizer.setUpdater(new SquaredL2Updater)
    +    else if (regType == "l1")
    +      lrAlg.optimizer.setUpdater(new L1Updater)
    --- End diff --
    
    By adding the exception to the scala code, I am going to remove the `ValueError` exception used in the python code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949963
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15559527
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -120,6 +120,23 @@ def train(cls, data, iterations=100, step=1.0,
                 d._jrdd, iterations, step, miniBatchFraction, i)
             return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights)
     
    +    @classmethod
    +    def trainL2Opt(cls, data, iterations=100, step=1.0, regParam=1.0,
    --- End diff --
    
    Ok! I am working in this issue!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50705527
  
    Btw, you can use `@param` for the doc for an argument. An example can be found at https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg.py#L41


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50946978
  
    @miccagiann Thanks for updating the PR! LGTM and waiting for Jenkins ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949270
  
    QA results for PR 1624:<br>- This patch FAILED unit tests.<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17725/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621881
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    Yes! I fixed it in the `regression.py` file where I was calling the same function again and again. As far as `PythonMLLibAPI().trainLinearRegressionModelWithSGD` I implement there the logic as well... I am building right now and I will commit instantly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50953197
  
    QA results for PR 1624:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952102
  
    LGTM. Waiting for Jenkins ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15623446
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,27 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    --- End diff --
    
    In Python, the line width for docs should be less than 80 (or 78 to be safe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50435652
  
    QA tests have started for PR 1624. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50950106
  
    QA tests have started for PR 1624. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17729/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50951638
  
    Yes, you need to merge the latest master and resolve conflicts first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949232
  
    QA tests have started for PR 1624. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17725/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15623329
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,27 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater" or regType == "L1Updater" or regType == "NONE":
    --- End diff --
    
    You can change it to `l2`, `l1`, and `none` (simpler and lowercases)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621439
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    @miccagiann It may be easier if you send the string directly to `PythonMLLibAPI().trainLinearRegressionModelWithSGD` and implement the logic there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50708428
  
    I have applied the suggested changes! Please notify me if any more modifications should be performed. Thanks for all your help Xiangrui.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952504
  
    OK!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50438015
  
    QA results for PR 1624:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627450
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
    --- End diff --
    
    We use two empty lines to separate methods in pyspark. (I don't know the exact reason ...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added support for regularizer and intercection...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50418096
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15621484
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    In the current version, all branches in the if-else block are essentially the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952492
  
    It should be part of the same JIRA. But let's do that in a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627468
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """
    +        Train a linear regression model on the given data.
    +
    +        @param data:              The training data.
    +        @param iterations:        The number of iterations (default: 100).
    +        @param step:              The step parameter used in SGD
    +                                  (default: 1.0).
    +        @param regParam:          The regularizer parameter (default: 1.0).
    +        @param regType:           The type of regularizer used for training
    +                                  our model.
    +                                  Allowed values: "l1" for using L1Updater,
    +                                                  "l2" for using
    +                                                       SquaredL2Updater,
    +                                                  "none" for no regularizer.
    +                                  (default: None)
    --- End diff --
    
    It may be better to set the default to `"none"` and map `None` to `"none"` in the implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50953229
  
    Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50704428
  
    Not at all! I am going to change them! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627458
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """
    +        Train a linear regression model on the given data.
    +
    +        @param data:              The training data.
    --- End diff --
    
    It is not necessary to align the doc, especially when limited by 78 characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952343
  
    Great! Do you mind adding regularization type and intercept to other linear methods? For example, `LogisticRegressionWithSGD` and  `SVMWithSGD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50574026
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952545
  
    QA results for PR 1624:<br>- This patch PASSES unit tests.<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17729/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50953670
  
    Alright, I was fixing my branches so as my new commits to be included correctly in the new PR I am going to create.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50945142
  
    I have applied the changes. Please notify me for further issues. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952046
  
    I have done it. Thanks for all your help! Now, I suppose that I need to call Jenkins again, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952132
  
    I added you to the whitelist. Jenkins should be triggered automatically for changes from you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952429
  
    Yes! I can do this. Is there an issue created in JIRA or it would be part of the same PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15506490
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -120,6 +120,23 @@ def train(cls, data, iterations=100, step=1.0,
                 d._jrdd, iterations, step, miniBatchFraction, i)
             return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights)
     
    +    @classmethod
    +    def trainL2Opt(cls, data, iterations=100, step=1.0, regParam=1.0,
    --- End diff --
    
    Instead of adding new methods, we can add optional parameters to the original train method. For example, `regType` and `regParam`. User can set `regType` to `l1`, `l2`, or `none` (default).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15623308
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -247,16 +248,24 @@ class PythonMLLibAPI extends Serializable {
           dataBytesJRDD: JavaRDD[Array[Byte]],
           numIterations: Int,
           stepSize: Double,
    +      regParam: Double,
    +      regType: String,
    +      intercept: Boolean,
           miniBatchFraction: Double,
           initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
    +    val lrAlg = new LinearRegressionWithSGD()
    +    lrAlg.setIntercept(intercept)
    +    lrAlg.optimizer.
    --- End diff --
    
    We usually put `.` at the beginning of the line:
    
    ~~~
    lrAlg.optimizer
      .setNumIterations(numIterations)
      .setRegParam(regParam)
      .setStepSize(stepSize)
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627480
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """
    +        Train a linear regression model on the given data.
    +
    +        @param data:              The training data.
    +        @param iterations:        The number of iterations (default: 100).
    +        @param step:              The step parameter used in SGD
    +                                  (default: 1.0).
    +        @param regParam:          The regularizer parameter (default: 1.0).
    +        @param regType:           The type of regularizer used for training
    +                                  our model.
    +                                  Allowed values: "l1" for using L1Updater,
    +                                                  "l2" for using
    +                                                       SquaredL2Updater,
    +                                                  "none" for no regularizer.
    +                                  (default: None)
    +        @param intercept:         Boolean parameter which indicates the use
    +                                  or not of the augmented representation for
    +                                  training data (i.e. whether bias features
    +                                  are activated or not).
    +        @param miniBatchFraction: Fraction of data to be used for each SGD
    +                                  iteration.
    +        @param initialWeights:    The initial weights (default: None).
    +        """
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "none", intercept, miniBatchFraction, i)
    +        elif regType == "l2" or regType == "l1" or regType == "none":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values: [l1, l2, none].")
             return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights)
     
    -
    --- End diff --
    
    ditto: please keep this empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50435465
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15622039
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,33 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """Train a linear regression model on the given data. The 'regType' parameter can take
    +           one from the following string values: "L1Updater" for invoking the lasso regularizer,
    +           "SquaredUpdater" for invoking the ridge regularizer or "NONE" for not using a
    +           regularizer at all. The user can determine the regularizer parameter by setting the
    +           appropriate value to variable 'regParam' (by default is set to 1.0)."""
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, "NONE", intercept, miniBatchFraction, i)
    +        elif regType == "SquaredUpdater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "L1Updater":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        elif regType == "NONE":
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    +                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
    +        else:
    +            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
    +                             "using the following string values [L1Updater, SquaredUpdater, NONE].")
    --- End diff --
    
    What about now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15627471
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,45 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    -
     class LinearRegressionWithSGD(object):
         @classmethod
    -    def train(cls, data, iterations=100, step=1.0,
    -              miniBatchFraction=1.0, initialWeights=None):
    -        """Train a linear regression model on the given data."""
    +    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
    +              intercept=False, miniBatchFraction=1.0, initialWeights=None):
    +        """
    +        Train a linear regression model on the given data.
    +
    +        @param data:              The training data.
    +        @param iterations:        The number of iterations (default: 100).
    +        @param step:              The step parameter used in SGD
    +                                  (default: 1.0).
    +        @param regParam:          The regularizer parameter (default: 1.0).
    +        @param regType:           The type of regularizer used for training
    +                                  our model.
    +                                  Allowed values: "l1" for using L1Updater,
    +                                                  "l2" for using
    +                                                       SquaredL2Updater,
    +                                                  "none" for no regularizer.
    +                                  (default: None)
    +        @param intercept:         Boolean parameter which indicates the use
    +                                  or not of the augmented representation for
    +                                  training data (i.e. whether bias features
    +                                  are activated or not).
    +        @param miniBatchFraction: Fraction of data to be used for each SGD
    +                                  iteration.
    +        @param initialWeights:    The initial weights (default: None).
    +        """
             sc = data.context
    -        train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    -            d._jrdd, iterations, step, miniBatchFraction, i)
    +        if regType is None:
    +            train_f = lambda d, i: sc._jvm.PythonMLLibAPI().trainLinearRegressionModelWithSGD(
    --- End diff --
    
    To avoid having the long command twice, you can use
    
    ~~~
    if regType is None:
      regType = "none"
    if regType in {"l2", "l1", "none"}:
        train_f = ...
    else:
        raise ...
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949450
  
    I need to put braces in `if` and `else if` statements in scala...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15566654
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable {
       private val DENSE_MATRIX_MAGIC: Byte = 3
       private val LABELED_POINT_MAGIC: Byte = 4
     
    +  /**
    +   * Enumeration used to define the type of Regularizer
    +   * used for linear methods.
    +   */
    +  object RegularizerType extends Serializable {
    +    val L2 : Int = 0
    +    val L1 : Int = 1
    +    val NONE : Int = 2
    +  }
    --- End diff --
    
    I used a type of Enumeration in order to separate between the different types of Update Methods [Regularizers] with which the user wants to create the model from training data. I tried to extend this object from Enumeration but from what I have seen it uses reflection heavily and it does not work well with serialized objects and with py4j...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50704107
  
    @miccagiann  For `regType`, I think people are more familiar with `l1` and `l2` than `L1Updater` and `SquaredUpdater`. Do you mind changing them? Lowercase names are preferred because we use lowercase method names in other places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15566894
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable {
       private val DENSE_MATRIX_MAGIC: Byte = 3
       private val LABELED_POINT_MAGIC: Byte = 4
     
    +  /**
    +   * Enumeration used to define the type of Regularizer
    +   * used for linear methods.
    +   */
    +  object RegularizerType extends Serializable {
    +    val L2 : Int = 0
    +    val L1 : Int = 1
    +    val NONE : Int = 2
    +  }
    --- End diff --
    
    Ok! I will do it with strings both in python and in scala. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50951256
  
    Xiangrui,
    
    After the tests are finished, should I merge my local branch with the upstream/master so as to make this patch merging smoothly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50938240
  
    @miccagiann In case you didn't know, today is the deadline for merging new features into v1.1. MLlib is less strict than core but we tried to meet the deadline if possible. This PR is nice to have in v1.1 . Do you have time to work on it today? Or I can make the changes and send you an PR to merge? Please let me know which works better for you. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50953699
  
    Xiangrui,
    
    I see that the JIRA issue is closed. Should we create a new one for the `LogisticRegressionWithSGD` and for `SVMWithSGD`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50949582
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1624#discussion_r15566690
  
    --- Diff: python/pyspark/mllib/regression.py ---
    @@ -109,18 +109,35 @@ class LinearRegressionModel(LinearRegressionModelBase):
         True
         """
     
    +class RegularizerType(object):
    +    L2 = 0
    +    L1 = 1
    +    NONE = 2
    --- End diff --
    
    The same enumeration is provided and through the `regression.py` file purely in python so as users to use it directly and avoid providing integer values to parameter `regType` in function `train` of `LinearRegressionWithSGD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

Posted by miccagiann <gi...@git.apache.org>.
Github user miccagiann commented on the pull request:

    https://github.com/apache/spark/pull/1624#issuecomment-50952158
  
    Nice! Thanks for everything! Tomorrow I am going to search for new issues
    under your supervision. You have helped me a lot!
    
    
    On Fri, Aug 1, 2014 at 10:55 PM, Xiangrui Meng <no...@github.com>
    wrote:
    
    > I added you to the whitelist. Jenkins should be triggered automatically
    > for changes from you.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1624#issuecomment-50952132>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---