You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/03/16 14:22:25 UTC

[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/17316

    [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

    Add Python wrapper for `Imputer` feature transformer.
    
    ## How was this patch tested?
    
    New doc tests and tweak to PySpark ML `tests.py`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark SPARK-15040-pyspark-imputer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17316.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17316
    
----
commit 5efe88953577fcb155f9e1c787e42d0e79841159
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-03-16T14:16:56Z

    Pyspark Imputer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17316


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74672/testReport)** for PR 17316 at commit [`325c9cf`](https://github.com/apache/spark/commit/325c9cf5e82f4c01fd5bd2ea93ba7ab24b2af2c2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74667/testReport)** for PR 17316 at commit [`5efe889`](https://github.com/apache/spark/commit/5efe88953577fcb155f9e1c787e42d0e79841159).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74669/testReport)** for PR 17316 at commit [`5e53e05`](https://github.com/apache/spark/commit/5e53e0562682a9a8a4259f21db5e2bd17372ce9d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    cc @hhbyyh 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #75012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75012/testReport)** for PR 17316 at commit [`7fd17dd`](https://github.com/apache/spark/commit/7fd17dd43441b2c7212f964efd921e8c2d429a9b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17316#discussion_r106490851
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -871,6 +872,164 @@ def idf(self):
     
     
     @inherit_doc
    +class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Imputation estimator for completing missing values, either using the mean or the median
    +    of the column in which the missing values are located. The input column should be of
    +    DoubleType or FloatType. Currently Imputer does not support categorical features and
    +    possibly creates incorrect values for a categorical feature.
    +
    +    Note that the mean/median value is computed after filtering out missing values.
    +    All Null values in the input column are treated as missing, and so are also imputed. For
    +    computing median, :py:meth:`approxQuantile` is used with a relative error of 0.001.
    +
    +    >>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, float("nan")), (float("nan"), 3.0),
    +    ...                             (4.0, 4.0), (5.0, 5.0)], ["a", "b"])
    +    >>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
    +    >>> model = imputer.fit(df)
    +    >>> model.surrogateDF.show()
    +    +---+---+
    +    |  a|  b|
    +    +---+---+
    +    |3.0|4.0|
    +    +---+---+
    +    ...
    +    >>> model.transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  1.0|  4.0|
    +    |2.0|NaN|  2.0|  4.0|
    +    |NaN|3.0|  3.0|  3.0|
    +    ...
    +    >>> imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  4.0|  NaN|
    +    ...
    +    >>> imputerPath = temp_path + "/imputer"
    +    >>> imputer.save(imputerPath)
    +    >>> loadedImputer = Imputer.load(imputerPath)
    +    >>> loadedImputer.getStrategy() == imputer.getStrategy()
    +    True
    +    >>> loadedImputer.getMissingValue()
    +    1.0
    +    >>> modelPath = temp_path + "/imputer-model"
    +    >>> model.save(modelPath)
    +    >>> loadedModel = ImputerModel.load(modelPath)
    +    >>> loadedModel.transform(df).head().out_a == model.transform(df).head().out_a
    +    True
    +
    +    .. versionadded:: 2.2.0
    +    """
    +
    +    outputCols = Param(Params._dummy(), "outputCols",
    +                       "output column names.", typeConverter=TypeConverters.toListString)
    +
    +    strategy = Param(Params._dummy(), "strategy",
    +                     "strategy for imputation. If mean, then replace missing values using the mean "
    +                     "value of the feature. If median, then replace missing values using the "
    +                     "median value of the feature.",
    +                     typeConverter=TypeConverters.toString)
    +
    +    missingValue = Param(Params._dummy(), "missingValue",
    +                         "The placeholder for the missing values. All occurrences of missingValue "
    +                         "will be imputed.", typeConverter=TypeConverters.toFloat)
    +
    +    @keyword_only
    +    def __init__(self, strategy="mean", missingValue=float("nan"), inputCols=None,
    +                 outputCols=None):
    +        """
    +        __init__(self, strategy="mean", missingValue=float("nan"), inputCols=None, \
    +                 outputCols=None):
    +        """
    +        super(Imputer, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.Imputer", self.uid)
    +        self._setDefault(strategy="mean", missingValue=float("nan"))
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.2.0")
    +    def setParams(self, strategy="mean", missingValue=float("nan"), inputCols=None,
    +                  outputCols=None):
    +        """
    +        setParams(self, strategy="mean", missingValue=float("nan"), inputCols=None, \
    +                  outputCols=None)
    +        Sets params for this Imputer.
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.2.0")
    +    def setOutputCols(self, value):
    +        """
    +        Sets the value of :py:attr:`outputCols`.
    +        """
    +        return self._set(outputCols=value)
    +
    +    @since("2.2.0")
    +    def getOutputCols(self):
    +        """
    +        Gets the value of :py:attr:`outputCols` or its default value.
    +        """
    +        return self.getOrDefault(self.outputCols)
    --- End diff --
    
    This reminds me we should add 
    ```
        require(get(inputCols).isDefined, "Input cols must be defined first.")
        require(get(outputCol).isDefined, "Output col must be defined first.")
    ```
    in transformschema


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17316#discussion_r106489373
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -871,6 +872,164 @@ def idf(self):
     
     
     @inherit_doc
    +class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Imputation estimator for completing missing values, either using the mean or the median
    +    of the column in which the missing values are located. The input column should be of
    +    DoubleType or FloatType. Currently Imputer does not support categorical features and
    +    possibly creates incorrect values for a categorical feature.
    +
    +    Note that the mean/median value is computed after filtering out missing values.
    +    All Null values in the input column are treated as missing, and so are also imputed. For
    +    computing median, :py:meth:`approxQuantile` is used with a relative error of 0.001.
    +
    +    >>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, float("nan")), (float("nan"), 3.0),
    +    ...                             (4.0, 4.0), (5.0, 5.0)], ["a", "b"])
    +    >>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
    +    >>> model = imputer.fit(df)
    +    >>> model.surrogateDF.show()
    +    +---+---+
    +    |  a|  b|
    +    +---+---+
    +    |3.0|4.0|
    +    +---+---+
    +    ...
    +    >>> model.transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  1.0|  4.0|
    +    |2.0|NaN|  2.0|  4.0|
    +    |NaN|3.0|  3.0|  3.0|
    +    ...
    +    >>> imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  4.0|  NaN|
    +    ...
    +    >>> imputerPath = temp_path + "/imputer"
    +    >>> imputer.save(imputerPath)
    +    >>> loadedImputer = Imputer.load(imputerPath)
    +    >>> loadedImputer.getStrategy() == imputer.getStrategy()
    +    True
    +    >>> loadedImputer.getMissingValue()
    +    1.0
    +    >>> modelPath = temp_path + "/imputer-model"
    +    >>> model.save(modelPath)
    +    >>> loadedModel = ImputerModel.load(modelPath)
    +    >>> loadedModel.transform(df).head().out_a == model.transform(df).head().out_a
    +    True
    +
    +    .. versionadded:: 2.2.0
    +    """
    +
    +    outputCols = Param(Params._dummy(), "outputCols",
    +                       "output column names.", typeConverter=TypeConverters.toListString)
    +
    +    strategy = Param(Params._dummy(), "strategy",
    +                     "strategy for imputation. If mean, then replace missing values using the mean "
    +                     "value of the feature. If median, then replace missing values using the "
    +                     "median value of the feature.",
    +                     typeConverter=TypeConverters.toString)
    +
    +    missingValue = Param(Params._dummy(), "missingValue",
    +                         "The placeholder for the missing values. All occurrences of missingValue "
    --- End diff --
    
    values -> value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74874/testReport)** for PR 17316 at commit [`5c272b5`](https://github.com/apache/spark/commit/5c272b5b7fb0988a8344e56ffc4e124128112879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #75012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75012/testReport)** for PR 17316 at commit [`7fd17dd`](https://github.com/apache/spark/commit/7fd17dd43441b2c7212f964efd921e8c2d429a9b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74672/testReport)** for PR 17316 at commit [`325c9cf`](https://github.com/apache/spark/commit/325c9cf5e82f4c01fd5bd2ea93ba7ab24b2af2c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74669/testReport)** for PR 17316 at commit [`5e53e05`](https://github.com/apache/spark/commit/5e53e0562682a9a8a4259f21db5e2bd17372ce9d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74874/testReport)** for PR 17316 at commit [`5c272b5`](https://github.com/apache/spark/commit/5c272b5b7fb0988a8344e56ffc4e124128112879).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74667/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74672/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75012/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    **[Test build #74667 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74667/testReport)** for PR 17316 at commit [`5efe889`](https://github.com/apache/spark/commit/5efe88953577fcb155f9e1c787e42d0e79841159).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):`
      * `class ImputerModel(JavaModel, JavaMLReadable, JavaMLWritable):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74874/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74669/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17316#discussion_r106615221
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -871,6 +872,164 @@ def idf(self):
     
     
     @inherit_doc
    +class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Imputation estimator for completing missing values, either using the mean or the median
    +    of the column in which the missing values are located. The input column should be of
    +    DoubleType or FloatType. Currently Imputer does not support categorical features and
    +    possibly creates incorrect values for a categorical feature.
    +
    +    Note that the mean/median value is computed after filtering out missing values.
    +    All Null values in the input column are treated as missing, and so are also imputed. For
    +    computing median, :py:meth:`approxQuantile` is used with a relative error of 0.001.
    +
    +    >>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, float("nan")), (float("nan"), 3.0),
    +    ...                             (4.0, 4.0), (5.0, 5.0)], ["a", "b"])
    +    >>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
    +    >>> model = imputer.fit(df)
    +    >>> model.surrogateDF.show()
    +    +---+---+
    +    |  a|  b|
    +    +---+---+
    +    |3.0|4.0|
    +    +---+---+
    +    ...
    +    >>> model.transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  1.0|  4.0|
    +    |2.0|NaN|  2.0|  4.0|
    +    |NaN|3.0|  3.0|  3.0|
    +    ...
    +    >>> imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  4.0|  NaN|
    +    ...
    +    >>> imputerPath = temp_path + "/imputer"
    +    >>> imputer.save(imputerPath)
    +    >>> loadedImputer = Imputer.load(imputerPath)
    +    >>> loadedImputer.getStrategy() == imputer.getStrategy()
    +    True
    +    >>> loadedImputer.getMissingValue()
    +    1.0
    +    >>> modelPath = temp_path + "/imputer-model"
    +    >>> model.save(modelPath)
    +    >>> loadedModel = ImputerModel.load(modelPath)
    +    >>> loadedModel.transform(df).head().out_a == model.transform(df).head().out_a
    +    True
    +
    +    .. versionadded:: 2.2.0
    +    """
    +
    +    outputCols = Param(Params._dummy(), "outputCols",
    +                       "output column names.", typeConverter=TypeConverters.toListString)
    +
    +    strategy = Param(Params._dummy(), "strategy",
    +                     "strategy for imputation. If mean, then replace missing values using the mean "
    +                     "value of the feature. If median, then replace missing values using the "
    +                     "median value of the feature.",
    +                     typeConverter=TypeConverters.toString)
    +
    +    missingValue = Param(Params._dummy(), "missingValue",
    +                         "The placeholder for the missing values. All occurrences of missingValue "
    +                         "will be imputed.", typeConverter=TypeConverters.toFloat)
    +
    +    @keyword_only
    +    def __init__(self, strategy="mean", missingValue=float("nan"), inputCols=None,
    +                 outputCols=None):
    +        """
    +        __init__(self, strategy="mean", missingValue=float("nan"), inputCols=None, \
    +                 outputCols=None):
    +        """
    +        super(Imputer, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.Imputer", self.uid)
    +        self._setDefault(strategy="mean", missingValue=float("nan"))
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.2.0")
    +    def setParams(self, strategy="mean", missingValue=float("nan"), inputCols=None,
    +                  outputCols=None):
    +        """
    +        setParams(self, strategy="mean", missingValue=float("nan"), inputCols=None, \
    +                  outputCols=None)
    +        Sets params for this Imputer.
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.2.0")
    +    def setOutputCols(self, value):
    +        """
    +        Sets the value of :py:attr:`outputCols`.
    +        """
    +        return self._set(outputCols=value)
    +
    +    @since("2.2.0")
    +    def getOutputCols(self):
    +        """
    +        Gets the value of :py:attr:`outputCols` or its default value.
    +        """
    +        return self.getOrDefault(self.outputCols)
    --- End diff --
    
    Do we really need that? The first call to `$(inputCols)` in `validateAndTransformSchema` will just throw an error with `Failed to find a default value ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17316
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17316#discussion_r106488785
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -871,6 +872,164 @@ def idf(self):
     
     
     @inherit_doc
    +class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Imputation estimator for completing missing values, either using the mean or the median
    +    of the column in which the missing values are located. The input column should be of
    --- End diff --
    
    Nit: Shall we change all the "column" to "columns" since we are supporting multiple columns now...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17316#discussion_r106614985
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -871,6 +872,164 @@ def idf(self):
     
     
     @inherit_doc
    +class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Imputation estimator for completing missing values, either using the mean or the median
    +    of the column in which the missing values are located. The input column should be of
    --- End diff --
    
    Will do for Python and Scala doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org