You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/11/21 03:30:45 UTC

[GitHub] spark pull request #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEn...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/23100

    [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder

    ## What changes were proposed in this pull request?
    
    We have deprecated OneHotEncoder at Spark 2.3.0 and introduced OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder.
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 remove_one_hot_encoder

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23100
    
----
commit cf6da4b72d04ab109739400dcbf6d75a9d34625e
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-11-21T03:25:18Z

    Remove OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99152/testReport)** for PR 23100 at commit [`40de38f`](https://github.com/apache/spark/commit/40de38f486d40a9517f9b3060bbd3cfd23d20986).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99367/testReport)** for PR 23100 at commit [`a451756`](https://github.com/apache/spark/commit/a451756f393c107963b893ba4a74c1d6ade33dd0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5256/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99102 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99102/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99152/testReport)** for PR 23100 at commit [`40de38f`](https://github.com/apache/spark/commit/40de38f486d40a9517f9b3060bbd3cfd23d20986).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99161/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Yes, this diff just reflect renaming OneHotEncoderEstimator to OneHotEncoder. Besides that, this also changes related documents and example codes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99168 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99168/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99168 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99168/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    cc @jkbradley @MLnick @dbtsai 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5269/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236411677
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    --- End diff --
    
    Or we can file two PRs. One for removing old `OneHotEncoder`, and the other one for renaming.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99154/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99161/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99095 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99095/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Thanks all! I will create a followup to add alias OneHotEncoderEstimator later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    This was discussed before when we added `OneHotEncoderEstimator` and there is a note at `OneHotEncoder`:
    
    https://github.com/apache/spark/blob/a5925c1631e25c2dcc3c2948cea31e993ce66a97/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L44-L45



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Change of this type can really piss some people off. Was there consensus on this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99367/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99366/testReport)** for PR 23100 at commit [`9e3dab7`](https://github.com/apache/spark/commit/9e3dab7388a6de18b7cf8ddcc8cc2c73a6efea67).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99367/testReport)** for PR 23100 at commit [`a451756`](https://github.com/apache/spark/commit/a451756f393c107963b893ba4a74c1d6ade33dd0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class OneHotEncoder @Since(\"3.0.0\") (@Since(\"3.0.0\") override val uid: String)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    I went through the PR again, and it looks right to me. Merged into master. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236410750
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkException
     import org.apache.spark.annotation.Since
    -import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.{Estimator, Model}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.linalg.Vectors
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql.{DataFrame, Dataset}
    -import org.apache.spark.sql.functions.{col, udf}
    -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
    +import org.apache.spark.sql.expressions.UserDefinedFunction
    +import org.apache.spark.sql.functions.{col, lit, udf}
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
    +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
    +    with HasInputCols with HasOutputCols {
    +
    +  /**
    +   * Param for how to handle invalid data during transform().
    +   * Options are 'keep' (invalid data presented as an extra categorical feature) or
    +   * 'error' (throw an error).
    +   * Note that this Param is only used during transform; during fitting, invalid data
    +   * will result in an error.
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    --- End diff --
    
    As we discussed previously, it's a new class. Should we make it as `3.0`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236471032
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkException
     import org.apache.spark.annotation.Since
    -import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.{Estimator, Model}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.linalg.Vectors
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql.{DataFrame, Dataset}
    -import org.apache.spark.sql.functions.{col, udf}
    -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
    +import org.apache.spark.sql.expressions.UserDefinedFunction
    +import org.apache.spark.sql.functions.{col, lit, udf}
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
    +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
    +    with HasInputCols with HasOutputCols {
    +
    +  /**
    +   * Param for how to handle invalid data during transform().
    +   * Options are 'keep' (invalid data presented as an extra categorical feature) or
    +   * 'error' (throw an error).
    +   * Note that this Param is only used during transform; during fitting, invalid data
    +   * will result in an error.
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    --- End diff --
    
    Yea, looks like we should make it `3.0`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Except for changing `Since` tag, is there any other comments?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5205/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99102 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99102/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5216/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99168/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99095/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    It's hard to track the huge diffs on renaming. I don't go though it line-by-line. But if they're just renaming, the rest LGTM.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99092/testReport)** for PR 23100 at commit [`cf6da4b`](https://github.com/apache/spark/commit/cf6da4b72d04ab109739400dcbf6d75a9d34625e).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5262/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r237010801
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkException
     import org.apache.spark.annotation.Since
    -import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.{Estimator, Model}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.linalg.Vectors
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql.{DataFrame, Dataset}
    -import org.apache.spark.sql.functions.{col, udf}
    -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
    +import org.apache.spark.sql.expressions.UserDefinedFunction
    +import org.apache.spark.sql.functions.{col, lit, udf}
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
    +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
    +    with HasInputCols with HasOutputCols {
    +
    +  /**
    +   * Param for how to handle invalid data during transform().
    +   * Options are 'keep' (invalid data presented as an extra categorical feature) or
    +   * 'error' (throw an error).
    +   * Note that this Param is only used during transform; during fitting, invalid data
    +   * will result in an error.
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    --- End diff --
    
    I changed since tag of renamed `OneHotEncoder` to `3.0.0`.
    
    Because this `OneHotEncoderBase` is not renamed, I didn't change its since tag.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99092/testReport)** for PR 23100 at commit [`cf6da4b`](https://github.com/apache/spark/commit/cf6da4b72d04ab109739400dcbf6d75a9d34625e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99152/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5257/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5268/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/23100


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99366/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236471495
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    --- End diff --
    
    If it's the same diff, leave it as is


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99366/testReport)** for PR 23100 at commit [`9e3dab7`](https://github.com/apache/spark/commit/9e3dab7388a6de18b7cf8ddcc8cc2c73a6efea67).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5208/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99095/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236410306
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    --- End diff --
    
    I guess once the commits of the history are squashed into one, it will still like this without better history.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    @srowen Thanks for review. I think there is no R counterpart for `OneHotEncoder`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5443/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r235886910
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkException
     import org.apache.spark.annotation.Since
    -import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.{Estimator, Model}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.linalg.Vectors
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql.{DataFrame, Dataset}
    -import org.apache.spark.sql.functions.{col, udf}
    -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
    +import org.apache.spark.sql.expressions.UserDefinedFunction
    +import org.apache.spark.sql.functions.{col, lit, udf}
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
    +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
    +    with HasInputCols with HasOutputCols {
    +
    +  /**
    +   * Param for how to handle invalid data during transform().
    +   * Options are 'keep' (invalid data presented as an extra categorical feature) or
    +   * 'error' (throw an error).
    +   * Note that this Param is only used during transform; during fitting, invalid data
    +   * will result in an error.
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
    +    "How to handle invalid data during transform(). " +
    +    "Options are 'keep' (invalid data presented as an extra categorical feature) " +
    +    "or error (throw an error). Note that this Param is only used during transform; " +
    +    "during fitting, invalid data will result in an error.",
    +    ParamValidators.inArray(OneHotEncoder.supportedHandleInvalids))
    +
    +  setDefault(handleInvalid, OneHotEncoder.ERROR_INVALID)
    +
    +  /**
    +   * Whether to drop the last category in the encoded vector (default: true)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val dropLast: BooleanParam =
    +    new BooleanParam(this, "dropLast", "whether to drop the last category")
    +  setDefault(dropLast -> true)
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getDropLast: Boolean = $(dropLast)
    +
    +  protected def validateAndTransformSchema(
    +      schema: StructType,
    +      dropLast: Boolean,
    +      keepInvalid: Boolean): StructType = {
    +    val inputColNames = $(inputCols)
    +    val outputColNames = $(outputCols)
    +
    +    require(inputColNames.length == outputColNames.length,
    +      s"The number of input columns ${inputColNames.length} must be the same as the number of " +
    +        s"output columns ${outputColNames.length}.")
    +
    +    // Input columns must be NumericType.
    +    inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
    +
    +    // Prepares output columns with proper attributes by examining input columns.
    +    val inputFields = $(inputCols).map(schema(_))
    +
    +    val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) =>
    +      OneHotEncoderCommon.transformOutputColumnSchema(
    +        inputField, outputColName, dropLast, keepInvalid)
    +    }
    +    outputFields.foldLeft(schema) { case (newSchema, outputField) =>
    +      SchemaUtils.appendColumn(newSchema, outputField)
    +    }
    +  }
    +}
     
     /**
      * A one-hot encoder that maps a column of category indices to a column of binary vectors, with
      * at most a single one-value per row that indicates the input category index.
      * For example with 5 categories, an input value of 2.0 would map to an output vector of
      * `[0.0, 0.0, 1.0, 0.0]`.
    - * The last category is not included by default (configurable via `OneHotEncoder!.dropLast`
    + * The last category is not included by default (configurable via `dropLast`),
      * because it makes the vector entries sum up to one, and hence linearly dependent.
      * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
      *
      * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories.
      * The output vectors are sparse.
      *
    + * When `handleInvalid` is configured to 'keep', an extra "category" indicating invalid values is
    + * added as last category. So when `dropLast` is true, invalid values are encoded as all-zeros
    + * vector.
    + *
    + * @note When encoding multi-column by using `inputCols` and `outputCols` params, input/output cols
    + * come in pairs, specified by the order in the arrays, and each pair is treated independently.
    + *
      * @see `StringIndexer` for converting categorical values into category indices
    - * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`
    - * will be removed in 3.0.0.
      */
    -@Since("1.4.0")
    -@deprecated("`OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`" +
    -  " will be removed in 3.0.0.", "2.3.0")
    -class OneHotEncoder @Since("1.4.0") (@Since("1.4.0") override val uid: String) extends Transformer
    -  with HasInputCol with HasOutputCol with DefaultParamsWritable {
    +@Since("2.3.0")
    +class OneHotEncoder @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    --- End diff --
    
    Or maybe since 3.0.0? It is a new class. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    retest this please...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    **[Test build #99154 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99154/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r235760329
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkException
     import org.apache.spark.annotation.Since
    -import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.{Estimator, Model}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.linalg.Vectors
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql.{DataFrame, Dataset}
    -import org.apache.spark.sql.functions.{col, udf}
    -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
    +import org.apache.spark.sql.expressions.UserDefinedFunction
    +import org.apache.spark.sql.functions.{col, lit, udf}
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
    +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
    +    with HasInputCols with HasOutputCols {
    +
    +  /**
    +   * Param for how to handle invalid data during transform().
    +   * Options are 'keep' (invalid data presented as an extra categorical feature) or
    +   * 'error' (throw an error).
    +   * Note that this Param is only used during transform; during fitting, invalid data
    +   * will result in an error.
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
    +    "How to handle invalid data during transform(). " +
    +    "Options are 'keep' (invalid data presented as an extra categorical feature) " +
    +    "or error (throw an error). Note that this Param is only used during transform; " +
    +    "during fitting, invalid data will result in an error.",
    +    ParamValidators.inArray(OneHotEncoder.supportedHandleInvalids))
    +
    +  setDefault(handleInvalid, OneHotEncoder.ERROR_INVALID)
    +
    +  /**
    +   * Whether to drop the last category in the encoded vector (default: true)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val dropLast: BooleanParam =
    +    new BooleanParam(this, "dropLast", "whether to drop the last category")
    +  setDefault(dropLast -> true)
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getDropLast: Boolean = $(dropLast)
    +
    +  protected def validateAndTransformSchema(
    +      schema: StructType,
    +      dropLast: Boolean,
    +      keepInvalid: Boolean): StructType = {
    +    val inputColNames = $(inputCols)
    +    val outputColNames = $(outputCols)
    +
    +    require(inputColNames.length == outputColNames.length,
    +      s"The number of input columns ${inputColNames.length} must be the same as the number of " +
    +        s"output columns ${outputColNames.length}.")
    +
    +    // Input columns must be NumericType.
    +    inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
    +
    +    // Prepares output columns with proper attributes by examining input columns.
    +    val inputFields = $(inputCols).map(schema(_))
    +
    +    val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) =>
    +      OneHotEncoderCommon.transformOutputColumnSchema(
    +        inputField, outputColName, dropLast, keepInvalid)
    +    }
    +    outputFields.foldLeft(schema) { case (newSchema, outputField) =>
    +      SchemaUtils.appendColumn(newSchema, outputField)
    +    }
    +  }
    +}
     
     /**
      * A one-hot encoder that maps a column of category indices to a column of binary vectors, with
      * at most a single one-value per row that indicates the input category index.
      * For example with 5 categories, an input value of 2.0 would map to an output vector of
      * `[0.0, 0.0, 1.0, 0.0]`.
    - * The last category is not included by default (configurable via `OneHotEncoder!.dropLast`
    + * The last category is not included by default (configurable via `dropLast`),
      * because it makes the vector entries sum up to one, and hence linearly dependent.
      * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
      *
      * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories.
      * The output vectors are sparse.
      *
    + * When `handleInvalid` is configured to 'keep', an extra "category" indicating invalid values is
    + * added as last category. So when `dropLast` is true, invalid values are encoded as all-zeros
    + * vector.
    + *
    + * @note When encoding multi-column by using `inputCols` and `outputCols` params, input/output cols
    + * come in pairs, specified by the order in the arrays, and each pair is treated independently.
    + *
      * @see `StringIndexer` for converting categorical values into category indices
    - * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`
    - * will be removed in 3.0.0.
      */
    -@Since("1.4.0")
    -@deprecated("`OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`" +
    -  " will be removed in 3.0.0.", "2.3.0")
    -class OneHotEncoder @Since("1.4.0") (@Since("1.4.0") override val uid: String) extends Transformer
    -  with HasInputCol with HasOutputCol with DefaultParamsWritable {
    +@Since("2.3.0")
    +class OneHotEncoder @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    --- End diff --
    
    In this renaming case, I'm not sure we should use the `Since` from old `OneHotEncoder` (`1.4.0`) or `OneHotEncoderEstimator` (`2.3.0`). Now I use `OneHotEncoderEstimator`'s.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236097273
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.SparkException
     import org.apache.spark.annotation.Since
    -import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.{Estimator, Model}
     import org.apache.spark.ml.attribute._
     import org.apache.spark.ml.linalg.Vectors
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql.{DataFrame, Dataset}
    -import org.apache.spark.sql.functions.{col, udf}
    -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
    +import org.apache.spark.sql.expressions.UserDefinedFunction
    +import org.apache.spark.sql.functions.{col, lit, udf}
    +import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
    +
    +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
    +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
    +    with HasInputCols with HasOutputCols {
    +
    +  /**
    +   * Param for how to handle invalid data during transform().
    +   * Options are 'keep' (invalid data presented as an extra categorical feature) or
    +   * 'error' (throw an error).
    +   * Note that this Param is only used during transform; during fitting, invalid data
    +   * will result in an error.
    +   * Default: "error"
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
    +    "How to handle invalid data during transform(). " +
    +    "Options are 'keep' (invalid data presented as an extra categorical feature) " +
    +    "or error (throw an error). Note that this Param is only used during transform; " +
    +    "during fitting, invalid data will result in an error.",
    +    ParamValidators.inArray(OneHotEncoder.supportedHandleInvalids))
    +
    +  setDefault(handleInvalid, OneHotEncoder.ERROR_INVALID)
    +
    +  /**
    +   * Whether to drop the last category in the encoded vector (default: true)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val dropLast: BooleanParam =
    +    new BooleanParam(this, "dropLast", "whether to drop the last category")
    +  setDefault(dropLast -> true)
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getDropLast: Boolean = $(dropLast)
    +
    +  protected def validateAndTransformSchema(
    +      schema: StructType,
    +      dropLast: Boolean,
    +      keepInvalid: Boolean): StructType = {
    +    val inputColNames = $(inputCols)
    +    val outputColNames = $(outputCols)
    +
    +    require(inputColNames.length == outputColNames.length,
    +      s"The number of input columns ${inputColNames.length} must be the same as the number of " +
    +        s"output columns ${outputColNames.length}.")
    +
    +    // Input columns must be NumericType.
    +    inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
    +
    +    // Prepares output columns with proper attributes by examining input columns.
    +    val inputFields = $(inputCols).map(schema(_))
    +
    +    val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) =>
    +      OneHotEncoderCommon.transformOutputColumnSchema(
    +        inputField, outputColName, dropLast, keepInvalid)
    +    }
    +    outputFields.foldLeft(schema) { case (newSchema, outputField) =>
    +      SchemaUtils.appendColumn(newSchema, outputField)
    +    }
    +  }
    +}
     
     /**
      * A one-hot encoder that maps a column of category indices to a column of binary vectors, with
      * at most a single one-value per row that indicates the input category index.
      * For example with 5 categories, an input value of 2.0 would map to an output vector of
      * `[0.0, 0.0, 1.0, 0.0]`.
    - * The last category is not included by default (configurable via `OneHotEncoder!.dropLast`
    + * The last category is not included by default (configurable via `dropLast`),
      * because it makes the vector entries sum up to one, and hence linearly dependent.
      * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
      *
      * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories.
      * The output vectors are sparse.
      *
    + * When `handleInvalid` is configured to 'keep', an extra "category" indicating invalid values is
    + * added as last category. So when `dropLast` is true, invalid values are encoded as all-zeros
    + * vector.
    + *
    + * @note When encoding multi-column by using `inputCols` and `outputCols` params, input/output cols
    + * come in pairs, specified by the order in the arrays, and each pair is treated independently.
    + *
      * @see `StringIndexer` for converting categorical values into category indices
    - * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`
    - * will be removed in 3.0.0.
      */
    -@Since("1.4.0")
    -@deprecated("`OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`" +
    -  " will be removed in 3.0.0.", "2.3.0")
    -class OneHotEncoder @Since("1.4.0") (@Since("1.4.0") override val uid: String) extends Transformer
    -  with HasInputCol with HasOutputCol with DefaultParamsWritable {
    +@Since("2.3.0")
    +class OneHotEncoder @Since("2.3.0") (@Since("2.3.0") override val uid: String)
    --- End diff --
    
    I'd say Since 3.0.0, as it's essentially new at this name, and any similarity to the old class is coincidental.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99102/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99092/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236469693
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    --- End diff --
    
    Yea, I've tried to remove old `OneHotEncoder` first and then do renaming. The git diff is still like this:
    https://github.com/apache/spark/compare/master...viirya:remove_one_hot_encoder_test?expand=1
    
    I'm ok if you prefer to have two PRs. WDYT? @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23100#discussion_r236097295
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
    @@ -17,126 +17,512 @@
     
     package org.apache.spark.ml.feature
     
    +import org.apache.hadoop.fs.Path
    --- End diff --
    
    The changes here are basically the copy-paste from OneHotEncoderEstimator? I wonder if we could structure this as a delete, followed by move, in git, for a better history. But, doesn't matter much or maybe it's treated the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99161/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5444/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/23100
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99154/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org