You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by vectorijk <gi...@git.apache.org> on 2016/03/02 13:51:50 UTC

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

GitHub user vectorijk opened a pull request:

    https://github.com/apache/spark/pull/11468

    [SPARK-13597][PySpark][ML] Python API for GeneralizedLinearRegression

    ## What changes were proposed in this pull request?
    
    Python API for GeneralizedLinearRegression
    JIRA: https://issues.apache.org/jira/browse/SPARK-13597
    
    ## How was this patch tested?
    
    The patch is tested with Python doctest.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vectorijk/spark spark-13597

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11468
    
----
commit af5f7ebc17e5e64b3bf3a7c05d3c02c26dbca801
Author: Kai Jiang <ji...@gmail.com>
Date:   2016-03-02T12:03:30Z

    [SPARK-13597][PySpark][ML] Python API for GeneralizedLinearRegression

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-207636548
  
    **[Test build #55402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55402/consoleFull)** for PR 11468 at commit [`f60e48c`](https://github.com/apache/spark/commit/f60e48c9fd5bf87e080a833dc4feed5f51427b96).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-207639631
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r54833559
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -857,6 +858,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (17.05224, Vectors.dense(3.55954, 11.19528)),
    +    ...     (13.46161, Vectors.dense(2.34561, 9.65407)),
    +    ...     (17.13384, Vectors.dense(3.37980, 12.03069)),
    +    ...     (13.84938, Vectors.dense(2.51969, 9.64902)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression()
    +    >>> model = glr.setFamily("gaussian").setLink("identity").fit(df)
    +    >>> model.transform(df).show()
    +    +--------+------------------+------------------+
    +    |   label|          features|        prediction|
    +    +--------+------------------+------------------+
    +    |17.05224|[3.55954,11.19528]|17.052776698886376|
    +    |13.46161| [2.34561,9.65407]|13.463078911930246|
    +    |17.13384| [3.3798,12.03069]| 17.13348844246882|
    +    |13.84938| [2.51969,9.64902]|13.847725946714558|
    +    +--------+------------------+------------------+
    +    ...
    +    >>> model.coefficients
    +    DenseVector([2.2263, 0.5756])
    +    >>> model.intercept
    +    2.6841196897757795
    --- End diff --
    
    The test may be unstable, it's better to use ellipsis when test double result such as ```2.68...```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r54834807
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -857,6 +858,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (17.05224, Vectors.dense(3.55954, 11.19528)),
    +    ...     (13.46161, Vectors.dense(2.34561, 9.65407)),
    +    ...     (17.13384, Vectors.dense(3.37980, 12.03069)),
    +    ...     (13.84938, Vectors.dense(2.51969, 9.64902)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression()
    +    >>> model = glr.setFamily("gaussian").setLink("identity").fit(df)
    +    >>> model.transform(df).show()
    +    +--------+------------------+------------------+
    +    |   label|          features|        prediction|
    +    +--------+------------------+------------------+
    +    |17.05224|[3.55954,11.19528]|17.052776698886376|
    +    |13.46161| [2.34561,9.65407]|13.463078911930246|
    +    |17.13384| [3.3798,12.03069]| 17.13348844246882|
    +    |13.84938| [2.51969,9.64902]|13.847725946714558|
    +    +--------+------------------+------------------+
    +    ...
    +    >>> model.coefficients
    +    DenseVector([2.2263, 0.5756])
    +    >>> model.intercept
    +    2.6841196897757795
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    family = Param(Params._dummy(), "family", "The name of family which is a description of " +
    +                   "the error distribution to be used in the model. Supported options: " +
    +                   "gaussian(default), binomial, poisson and gamma.")
    +    link = Param(Params._dummy(), "link", "The name of link function which provides the " +
    +                 "relationship between the linear predictor and the mean of the distribution " +
    +                 "function. Supported options: identity, log, inverse, logit, probit, cloglog " +
    +                 "and sqrt.")
    +
    +    @keyword_only
    +    def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None,
    +                 solver="irls"):
    +        """
    +        __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", \
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, \
    +                 solver="irls")
    +        """
    +        super(GeneralizedLinearRegression, self).__init__()
    +        self._java_obj = self._new_java_obj(
    +            "org.apache.spark.ml.regression.GeneralizedLinearRegression", self.uid)
    +        self._setDefault(family="gaussian", link="identity")
    --- End diff --
    
    @vectorijk For example, when users set ```family=binomial```, if he does not set ```link``` then the ```link``` will be set as ```logit``` when training and prediction. Because ```logit``` is the default link function for ```binomial``` family. But users can also set ```link=probit``` or ```link=cloglog```. 
    Here we do not set default value for ```link``` is that different ```family``` will have different default link function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205730049
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r54834301
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -857,6 +858,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (17.05224, Vectors.dense(3.55954, 11.19528)),
    +    ...     (13.46161, Vectors.dense(2.34561, 9.65407)),
    +    ...     (17.13384, Vectors.dense(3.37980, 12.03069)),
    +    ...     (13.84938, Vectors.dense(2.51969, 9.64902)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression()
    +    >>> model = glr.setFamily("gaussian").setLink("identity").fit(df)
    +    >>> model.transform(df).show()
    +    +--------+------------------+------------------+
    +    |   label|          features|        prediction|
    +    +--------+------------------+------------------+
    +    |17.05224|[3.55954,11.19528]|17.052776698886376|
    +    |13.46161| [2.34561,9.65407]|13.463078911930246|
    +    |17.13384| [3.3798,12.03069]| 17.13348844246882|
    +    |13.84938| [2.51969,9.64902]|13.847725946714558|
    +    +--------+------------------+------------------+
    +    ...
    +    >>> model.coefficients
    +    DenseVector([2.2263, 0.5756])
    +    >>> model.intercept
    +    2.6841196897757795
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    family = Param(Params._dummy(), "family", "The name of family which is a description of " +
    +                   "the error distribution to be used in the model. Supported options: " +
    +                   "gaussian(default), binomial, poisson and gamma.")
    +    link = Param(Params._dummy(), "link", "The name of link function which provides the " +
    +                 "relationship between the linear predictor and the mean of the distribution " +
    +                 "function. Supported options: identity, log, inverse, logit, probit, cloglog " +
    +                 "and sqrt.")
    +
    +    @keyword_only
    +    def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None,
    +                 solver="irls"):
    +        """
    +        __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", \
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, \
    +                 solver="irls")
    +        """
    +        super(GeneralizedLinearRegression, self).__init__()
    +        self._java_obj = self._new_java_obj(
    +            "org.apache.spark.ml.regression.GeneralizedLinearRegression", self.uid)
    +        self._setDefault(family="gaussian", link="identity")
    --- End diff --
    
    You mean the link would be set once family set, right? If so, could we just let `link` empty and don't pass it anything?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-191228337
  
    cc @mengxr @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205731070
  
    **[Test build #54977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54977/consoleFull)** for PR 11468 at commit [`822c844`](https://github.com/apache/spark/commit/822c8444a21316fd56831448dd4dd6b594a2672c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-207194961
  
    @vectorijk This PR looks good overall, please address my last comments. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r54774746
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -857,6 +858,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (17.05224, Vectors.dense(3.55954, 11.19528)),
    +    ...     (13.46161, Vectors.dense(2.34561, 9.65407)),
    +    ...     (17.13384, Vectors.dense(3.37980, 12.03069)),
    +    ...     (13.84938, Vectors.dense(2.51969, 9.64902)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression()
    --- End diff --
    
    `glr = GeneralizedLinearRegression(family="gaussian", link="identity")`
    
    keyword args are preferred in python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r59099605
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -934,6 +935,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver, JavaMLWritable, JavaMLReadable):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (1.0, Vectors.dense(1.0, 0.0)),
    +    ...     (1.0, Vectors.dense(1.0, 2.0)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression(family="gaussian", link="identity")
    +    >>> model = glr.fit(df)
    +    >>> abs(model.transform(df).head().prediction - 1.0) < 0.001
    +    True
    +    >>> model.coefficients
    +    DenseVector([0.0, 0.0])
    +    >>> abs(model.intercept - 1.0) < 0.001
    +    True
    +    >>> glr_path = temp_path + "/glr"
    +    >>> glr.save(glr_path)
    +    >>> glr2 = GeneralizedLinearRegression.load(glr_path)
    +    >>> glr.getFamily() == glr2.getFamily()
    +    True
    +    >>> model_path = temp_path + "/glr_model"
    +    >>> model.save(model_path)
    +    >>> model2 = GeneralizedLinearRegressionModel.load(model_path)
    +    >>> abs(model.intercept - model2.intercept) < 0.001
    +    True
    --- End diff --
    
    @yanboliang addressed your comments. Also, I modified the test data so that coefficients are not zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-191228299
  
    **[Test build #52317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52317/consoleFull)** for PR 11468 at commit [`af5f7eb`](https://github.com/apache/spark/commit/af5f7ebc17e5e64b3bf3a7c05d3c02c26dbca801).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205728077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54974/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205733822
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-204450309
  
    @vectorijk Do you have time to update this PR? If not, I can help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-208555285
  
    **[Test build #55534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55534/consoleFull)** for PR 11468 at commit [`4199c93`](https://github.com/apache/spark/commit/4199c93997b4b97cab03e1adb9019789f9a56673).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-207639633
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55402/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205733824
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54977/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-207639555
  
    **[Test build #55402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55402/consoleFull)** for PR 11468 at commit [`f60e48c`](https://github.com/apache/spark/commit/f60e48c9fd5bf87e080a833dc4feed5f51427b96).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-208559614
  
    @jkbradley I have addressed all the comments. Could you review this again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-191231224
  
    **[Test build #52317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52317/consoleFull)** for PR 11468 at commit [`af5f7eb`](https://github.com/apache/spark/commit/af5f7ebc17e5e64b3bf3a7c05d3c02c26dbca801).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,`
      * `class GeneralizedLinearRegressionModel(JavaModel):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-208354853
  
    LGTM except the last minor issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205733698
  
    **[Test build #54977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54977/consoleFull)** for PR 11468 at commit [`822c844`](https://github.com/apache/spark/commit/822c8444a21316fd56831448dd4dd6b594a2672c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class GeneralizedLinearRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-191231638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52317/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r59206534
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -934,6 +935,150 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver, JavaMLWritable, JavaMLReadable):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (1.0, Vectors.dense(0.0, 0.0)),
    +    ...     (1.0, Vectors.dense(1.0, 2.0)),
    +    ...     (2.0, Vectors.dense(0.0, 0.0)),
    +    ...     (2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression(family="gaussian", link="identity")
    +    >>> model = glr.fit(df)
    +    >>> abs(model.transform(df).head().prediction - 1.5) < 0.001
    +    True
    +    >>> model.coefficients
    +    DenseVector([1.5, -1.0])
    --- End diff --
    
    ```DenseVector([1.5..., -1.0...])``` should be more robust.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205723699
  
    **[Test build #54974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54974/consoleFull)** for PR 11468 at commit [`822c844`](https://github.com/apache/spark/commit/822c8444a21316fd56831448dd4dd6b594a2672c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r54834037
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -857,6 +858,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (17.05224, Vectors.dense(3.55954, 11.19528)),
    +    ...     (13.46161, Vectors.dense(2.34561, 9.65407)),
    +    ...     (17.13384, Vectors.dense(3.37980, 12.03069)),
    +    ...     (13.84938, Vectors.dense(2.51969, 9.64902)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression()
    +    >>> model = glr.setFamily("gaussian").setLink("identity").fit(df)
    +    >>> model.transform(df).show()
    +    +--------+------------------+------------------+
    +    |   label|          features|        prediction|
    +    +--------+------------------+------------------+
    +    |17.05224|[3.55954,11.19528]|17.052776698886376|
    +    |13.46161| [2.34561,9.65407]|13.463078911930246|
    +    |17.13384| [3.3798,12.03069]| 17.13348844246882|
    +    |13.84938| [2.51969,9.64902]|13.847725946714558|
    +    +--------+------------------+------------------+
    +    ...
    +    >>> model.coefficients
    +    DenseVector([2.2263, 0.5756])
    +    >>> model.intercept
    +    2.6841196897757795
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    family = Param(Params._dummy(), "family", "The name of family which is a description of " +
    +                   "the error distribution to be used in the model. Supported options: " +
    +                   "gaussian(default), binomial, poisson and gamma.")
    +    link = Param(Params._dummy(), "link", "The name of link function which provides the " +
    +                 "relationship between the linear predictor and the mean of the distribution " +
    +                 "function. Supported options: identity, log, inverse, logit, probit, cloglog " +
    +                 "and sqrt.")
    +
    +    @keyword_only
    +    def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None,
    +                 solver="irls"):
    +        """
    +        __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", \
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, \
    +                 solver="irls")
    +        """
    +        super(GeneralizedLinearRegression, self).__init__()
    +        self._java_obj = self._new_java_obj(
    +            "org.apache.spark.ml.regression.GeneralizedLinearRegression", self.uid)
    +        self._setDefault(family="gaussian", link="identity")
    --- End diff --
    
    We did not set default value for ```link``` at Scala side because it was decided by ```family```. For example, if users set ```family="binomial"``` and did not set link, the ```link``` will be set as ```logit``` when training and prediction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-204452107
  
    Sorry about late response. Yes, i will catch this today.
    
    On Fri, Apr 1, 2016, 09:02 Yanbo Liang <no...@github.com> wrote:
    
    > @vectorijk <https://github.com/vectorijk> Do you have time to update this
    > PR? If not, I can help.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11468#issuecomment-204450309>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11468


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r58511272
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -934,6 +935,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver, JavaMLWritable, JavaMLReadable):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (1.0, Vectors.dense(1.0, 0.0)),
    +    ...     (1.0, Vectors.dense(1.0, 2.0)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression(family="gaussian", link="identity")
    +    >>> model = glr.fit(df)
    +    >>> abs(model.transform(df).head().prediction - 1.0) < 0.001
    +    True
    +    >>> model.coefficients
    +    DenseVector([0.0, 0.0])
    +    >>> abs(model.intercept - 1.0) < 0.001
    +    True
    +    >>> glr_path = temp_path + "/glr"
    +    >>> glr.save(glr_path)
    +    >>> glr2 = GeneralizedLinearRegression.load(glr_path)
    +    >>> glr.getFamily() == glr2.getFamily()
    +    True
    +    >>> model_path = temp_path + "/glr_model"
    +    >>> model.save(model_path)
    +    >>> model2 = GeneralizedLinearRegressionModel.load(model_path)
    +    >>> abs(model.intercept - model2.intercept) < 0.001
    +    True
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    family = Param(Params._dummy(), "family", "The name of family which is a description of " +
    +                   "the error distribution to be used in the model. Supported options: " +
    +                   "gaussian(default), binomial, poisson and gamma.")
    +    link = Param(Params._dummy(), "link", "The name of link function which provides the " +
    +                 "relationship between the linear predictor and the mean of the distribution " +
    +                 "function. Supported options: identity, log, inverse, logit, probit, cloglog " +
    +                 "and sqrt.")
    +
    +    @keyword_only
    +    def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
    +                 family="gaussian", link="identity", fitIntercept=True, maxIter=25, tol=1e-6,
    +                 regParam=0.0, weightCol=None, solver="irls"):
    +        """
    +        __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", \
    +                 family="gaussian", link="identity", fitIntercept=True, maxIter=25, tol=1e-6, \
    +                 regParam=0.0, weightCol=None, solver="irls")
    +        """
    +        super(GeneralizedLinearRegression, self).__init__()
    +        self._java_obj = self._new_java_obj(
    +            "org.apache.spark.ml.regression.GeneralizedLinearRegression", self.uid)
    +        self._setDefault(family="gaussian", maxIter=25, tol=1e-6, regParam=0.0, solver="irls")
    --- End diff --
    
    @yanboliang addressed what you mentioned. Could you review this again? Thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-209043366
  
    Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205728072
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-208558167
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by vectorijk <gi...@git.apache.org>.

Github user vectorijk commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-207689752
  
    @yanboliang Thanks! I have addressed your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r54774938
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -857,6 +858,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (17.05224, Vectors.dense(3.55954, 11.19528)),
    --- End diff --
    
    Shall we use a simple example? This appears in the API doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-208558168
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55534/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-191231630
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-208558037
  
    **[Test build #55534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55534/consoleFull)** for PR 11468 at commit [`4199c93`](https://github.com/apache/spark/commit/4199c93997b4b97cab03e1adb9019789f9a56673).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11468#issuecomment-205728037
  
    **[Test build #54974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54974/consoleFull)** for PR 11468 at commit [`822c844`](https://github.com/apache/spark/commit/822c8444a21316fd56831448dd4dd6b594a2672c).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class GeneralizedLinearRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r54834511
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -857,6 +858,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (17.05224, Vectors.dense(3.55954, 11.19528)),
    +    ...     (13.46161, Vectors.dense(2.34561, 9.65407)),
    +    ...     (17.13384, Vectors.dense(3.37980, 12.03069)),
    +    ...     (13.84938, Vectors.dense(2.51969, 9.64902)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression()
    +    >>> model = glr.setFamily("gaussian").setLink("identity").fit(df)
    +    >>> model.transform(df).show()
    +    +--------+------------------+------------------+
    +    |   label|          features|        prediction|
    +    +--------+------------------+------------------+
    +    |17.05224|[3.55954,11.19528]|17.052776698886376|
    +    |13.46161| [2.34561,9.65407]|13.463078911930246|
    +    |17.13384| [3.3798,12.03069]| 17.13348844246882|
    +    |13.84938| [2.51969,9.64902]|13.847725946714558|
    +    +--------+------------------+------------------+
    +    ...
    +    >>> model.coefficients
    +    DenseVector([2.2263, 0.5756])
    +    >>> model.intercept
    +    2.6841196897757795
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    family = Param(Params._dummy(), "family", "The name of family which is a description of " +
    +                   "the error distribution to be used in the model. Supported options: " +
    +                   "gaussian(default), binomial, poisson and gamma.")
    +    link = Param(Params._dummy(), "link", "The name of link function which provides the " +
    +                 "relationship between the linear predictor and the mean of the distribution " +
    +                 "function. Supported options: identity, log, inverse, logit, probit, cloglog " +
    +                 "and sqrt.")
    +
    +    @keyword_only
    +    def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None,
    +                 solver="irls"):
    +        """
    +        __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", \
    +                 fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, \
    +                 solver="irls")
    +        """
    +        super(GeneralizedLinearRegression, self).__init__()
    +        self._java_obj = self._new_java_obj(
    +            "org.apache.spark.ml.regression.GeneralizedLinearRegression", self.uid)
    +        self._setDefault(family="gaussian", link="identity")
    --- End diff --
    
    Here we should also set default for ```maxIter, regParam, tol, weightCol and solver```. PySpark store default value of param in ```_defaultParamMap``` and users specified value of param in ```_paramMap```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r58976307
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -934,6 +935,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver, JavaMLWritable, JavaMLReadable):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (1.0, Vectors.dense(1.0, 0.0)),
    +    ...     (1.0, Vectors.dense(1.0, 2.0)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression(family="gaussian", link="identity")
    +    >>> model = glr.fit(df)
    +    >>> abs(model.transform(df).head().prediction - 1.0) < 0.001
    +    True
    +    >>> model.coefficients
    +    DenseVector([0.0, 0.0])
    +    >>> abs(model.intercept - 1.0) < 0.001
    +    True
    +    >>> glr_path = temp_path + "/glr"
    +    >>> glr.save(glr_path)
    +    >>> glr2 = GeneralizedLinearRegression.load(glr_path)
    +    >>> glr.getFamily() == glr2.getFamily()
    +    True
    +    >>> model_path = temp_path + "/glr_model"
    +    >>> model.save(model_path)
    +    >>> model2 = GeneralizedLinearRegressionModel.load(model_path)
    +    >>> abs(model.intercept - model2.intercept) < 0.001
    +    True
    --- End diff --
    
    Should be ```model.intercept == model2.intercept```, because they are the same value just after save/load.
    May be we should also check ```model.coefficients[0] == model2.coefficients[0]```, but the coefficients of your test case are all zeros, I think it's better to modify the test data which can produce non zero coefficients.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13597][PySpark][ML] Python API for Gene...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11468#discussion_r58975761
  
    --- Diff: python/pyspark/ml/regression.py ---
    @@ -934,6 +935,146 @@ def predict(self, features):
             return self._call_java("predict", features)
     
     
    +@inherit_doc
    +class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, HasPredictionCol,
    +                                  HasFitIntercept, HasMaxIter, HasTol, HasRegParam, HasWeightCol,
    +                                  HasSolver, JavaMLWritable, JavaMLReadable):
    +    """
    +    Generalized Linear Regression.
    +
    +    Fit a Generalized Linear Model specified by giving a symbolic description of the linear
    +    predictor (link function) and a description of the error distribution (family). It supports
    +    "gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
    +    is listed below. The first link function of each family is the default one.
    +    - "gaussian" -> "identity", "log", "inverse"
    +    - "binomial" -> "logit", "probit", "cloglog"
    +    - "poisson"  -> "log", "identity", "sqrt"
    +    - "gamma"    -> "inverse", "identity", "log"
    +
    +    .. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_
    +
    +    >>> from pyspark.mllib.linalg import Vectors
    +    >>> df = sqlContext.createDataFrame([
    +    ...     (1.0, Vectors.dense(1.0, 0.0)),
    +    ...     (1.0, Vectors.dense(1.0, 2.0)),], ["label", "features"])
    +    >>> glr = GeneralizedLinearRegression(family="gaussian", link="identity")
    +    >>> model = glr.fit(df)
    +    >>> abs(model.transform(df).head().prediction - 1.0) < 0.001
    +    True
    +    >>> model.coefficients
    +    DenseVector([0.0, 0.0])
    +    >>> abs(model.intercept - 1.0) < 0.001
    +    True
    +    >>> glr_path = temp_path + "/glr"
    +    >>> glr.save(glr_path)
    +    >>> glr2 = GeneralizedLinearRegression.load(glr_path)
    +    >>> glr.getFamily() == glr2.getFamily()
    +    True
    +    >>> model_path = temp_path + "/glr_model"
    +    >>> model.save(model_path)
    +    >>> model2 = GeneralizedLinearRegressionModel.load(model_path)
    +    >>> abs(model.intercept - model2.intercept) < 0.001
    +    True
    +
    +    .. versionadded:: 2.0.0
    +    """
    +
    +    family = Param(Params._dummy(), "family", "The name of family which is a description of " +
    +                   "the error distribution to be used in the model. Supported options: " +
    +                   "gaussian(default), binomial, poisson and gamma.")
    +    link = Param(Params._dummy(), "link", "The name of link function which provides the " +
    +                 "relationship between the linear predictor and the mean of the distribution " +
    +                 "function. Supported options: identity, log, inverse, logit, probit, cloglog " +
    +                 "and sqrt.")
    +
    +    @keyword_only
    +    def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
    +                 family="gaussian", link="identity", fitIntercept=True, maxIter=25, tol=1e-6,
    --- End diff --
    
    ```link=None```, modify the doc and ```setParams``` meanwhile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org