You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/11/21 03:30:45 UTC
[GitHub] spark pull request #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEn...
GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/23100
[WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder
## What changes were proposed in this pull request?
We have deprecated OneHotEncoder at Spark 2.3.0 and introduced OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder.
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 remove_one_hot_encoder
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23100.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23100
----
commit cf6da4b72d04ab109739400dcbf6d75a9d34625e
Author: Liang-Chi Hsieh <vi...@...>
Date: 2018-11-21T03:25:18Z
Remove OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99152/testReport)** for PR 23100 at commit [`40de38f`](https://github.com/apache/spark/commit/40de38f486d40a9517f9b3060bbd3cfd23d20986).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99367/testReport)** for PR 23100 at commit [`a451756`](https://github.com/apache/spark/commit/a451756f393c107963b893ba4a74c1d6ade33dd0).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5256/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99102 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99102/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99152/testReport)** for PR 23100 at commit [`40de38f`](https://github.com/apache/spark/commit/40de38f486d40a9517f9b3060bbd3cfd23d20986).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99161/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
Yes, this diff just reflect renaming OneHotEncoderEstimator to OneHotEncoder. Besides that, this also changes related documents and example codes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99168 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99168/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99168 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99168/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
cc @jkbradley @MLnick @dbtsai
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5269/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236411677
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
--- End diff --
Or we can file two PRs. One for removing old `OneHotEncoder`, and the other one for renaming.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99154/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99161/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99095 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99095/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
Thanks all! I will create a followup to add alias OneHotEncoderEstimator later.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
This was discussed before when we added `OneHotEncoderEstimator` and there is a note at `OneHotEncoder`:
https://github.com/apache/spark/blob/a5925c1631e25c2dcc3c2948cea31e993ce66a97/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L44-L45
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/23100
Change of this type can really piss some people off. Was there consensus on this?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99367/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99366/testReport)** for PR 23100 at commit [`9e3dab7`](https://github.com/apache/spark/commit/9e3dab7388a6de18b7cf8ddcc8cc2c73a6efea67).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99367/testReport)** for PR 23100 at commit [`a451756`](https://github.com/apache/spark/commit/a451756f393c107963b893ba4a74c1d6ade33dd0).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class OneHotEncoder @Since(\"3.0.0\") (@Since(\"3.0.0\") override val uid: String)`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:
https://github.com/apache/spark/pull/23100
I went through the PR again, and it looks right to me. Merged into master. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236410750
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
import org.apache.spark.ml.attribute._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
import org.apache.spark.ml.util._
import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+ with HasInputCols with HasOutputCols {
+
+ /**
+ * Param for how to handle invalid data during transform().
+ * Options are 'keep' (invalid data presented as an extra categorical feature) or
+ * 'error' (throw an error).
+ * Note that this Param is only used during transform; during fitting, invalid data
+ * will result in an error.
+ * Default: "error"
+ * @group param
+ */
+ @Since("2.3.0")
--- End diff --
As we discussed previously, it's a new class. Should we make it as `3.0`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236471032
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
import org.apache.spark.ml.attribute._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
import org.apache.spark.ml.util._
import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+ with HasInputCols with HasOutputCols {
+
+ /**
+ * Param for how to handle invalid data during transform().
+ * Options are 'keep' (invalid data presented as an extra categorical feature) or
+ * 'error' (throw an error).
+ * Note that this Param is only used during transform; during fitting, invalid data
+ * will result in an error.
+ * Default: "error"
+ * @group param
+ */
+ @Since("2.3.0")
--- End diff --
Yea, looks like we should make it `3.0`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
Except for changing `Since` tag, is there any other comments?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5205/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99102 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99102/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5216/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99168/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99095/testReport)** for PR 23100 at commit [`ee3de58`](https://github.com/apache/spark/commit/ee3de5862e975f9659af474c43133294ec5ce369).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:
https://github.com/apache/spark/pull/23100
It's hard to track the huge diffs on renaming. I don't go though it line-by-line. But if they're just renaming, the rest LGTM.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99092/testReport)** for PR 23100 at commit [`cf6da4b`](https://github.com/apache/spark/commit/cf6da4b72d04ab109739400dcbf6d75a9d34625e).
* This patch **fails MiMa tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5262/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r237010801
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
import org.apache.spark.ml.attribute._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
import org.apache.spark.ml.util._
import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+ with HasInputCols with HasOutputCols {
+
+ /**
+ * Param for how to handle invalid data during transform().
+ * Options are 'keep' (invalid data presented as an extra categorical feature) or
+ * 'error' (throw an error).
+ * Note that this Param is only used during transform; during fitting, invalid data
+ * will result in an error.
+ * Default: "error"
+ * @group param
+ */
+ @Since("2.3.0")
--- End diff --
I changed since tag of renamed `OneHotEncoder` to `3.0.0`.
Because this `OneHotEncoderBase` is not renamed, I didn't change its since tag.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99092/testReport)** for PR 23100 at commit [`cf6da4b`](https://github.com/apache/spark/commit/cf6da4b72d04ab109739400dcbf6d75a9d34625e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99152/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5257/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5268/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/23100
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99366/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236471495
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
--- End diff --
If it's the same diff, leave it as is
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99366/testReport)** for PR 23100 at commit [`9e3dab7`](https://github.com/apache/spark/commit/9e3dab7388a6de18b7cf8ddcc8cc2c73a6efea67).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5208/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99095/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236410306
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
--- End diff --
I guess once the commits of the history are squashed into one, it will still like this without better history.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
@srowen Thanks for review. I think there is no R counterpart for `OneHotEncoder`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5443/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by zhengruifeng <gi...@git.apache.org>.
Github user zhengruifeng commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r235886910
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
import org.apache.spark.ml.attribute._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
import org.apache.spark.ml.util._
import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+ with HasInputCols with HasOutputCols {
+
+ /**
+ * Param for how to handle invalid data during transform().
+ * Options are 'keep' (invalid data presented as an extra categorical feature) or
+ * 'error' (throw an error).
+ * Note that this Param is only used during transform; during fitting, invalid data
+ * will result in an error.
+ * Default: "error"
+ * @group param
+ */
+ @Since("2.3.0")
+ override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
+ "How to handle invalid data during transform(). " +
+ "Options are 'keep' (invalid data presented as an extra categorical feature) " +
+ "or error (throw an error). Note that this Param is only used during transform; " +
+ "during fitting, invalid data will result in an error.",
+ ParamValidators.inArray(OneHotEncoder.supportedHandleInvalids))
+
+ setDefault(handleInvalid, OneHotEncoder.ERROR_INVALID)
+
+ /**
+ * Whether to drop the last category in the encoded vector (default: true)
+ * @group param
+ */
+ @Since("2.3.0")
+ final val dropLast: BooleanParam =
+ new BooleanParam(this, "dropLast", "whether to drop the last category")
+ setDefault(dropLast -> true)
+
+ /** @group getParam */
+ @Since("2.3.0")
+ def getDropLast: Boolean = $(dropLast)
+
+ protected def validateAndTransformSchema(
+ schema: StructType,
+ dropLast: Boolean,
+ keepInvalid: Boolean): StructType = {
+ val inputColNames = $(inputCols)
+ val outputColNames = $(outputCols)
+
+ require(inputColNames.length == outputColNames.length,
+ s"The number of input columns ${inputColNames.length} must be the same as the number of " +
+ s"output columns ${outputColNames.length}.")
+
+ // Input columns must be NumericType.
+ inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
+
+ // Prepares output columns with proper attributes by examining input columns.
+ val inputFields = $(inputCols).map(schema(_))
+
+ val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) =>
+ OneHotEncoderCommon.transformOutputColumnSchema(
+ inputField, outputColName, dropLast, keepInvalid)
+ }
+ outputFields.foldLeft(schema) { case (newSchema, outputField) =>
+ SchemaUtils.appendColumn(newSchema, outputField)
+ }
+ }
+}
/**
* A one-hot encoder that maps a column of category indices to a column of binary vectors, with
* at most a single one-value per row that indicates the input category index.
* For example with 5 categories, an input value of 2.0 would map to an output vector of
* `[0.0, 0.0, 1.0, 0.0]`.
- * The last category is not included by default (configurable via `OneHotEncoder!.dropLast`
+ * The last category is not included by default (configurable via `dropLast`),
* because it makes the vector entries sum up to one, and hence linearly dependent.
* So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
*
* @note This is different from scikit-learn's OneHotEncoder, which keeps all categories.
* The output vectors are sparse.
*
+ * When `handleInvalid` is configured to 'keep', an extra "category" indicating invalid values is
+ * added as last category. So when `dropLast` is true, invalid values are encoded as all-zeros
+ * vector.
+ *
+ * @note When encoding multi-column by using `inputCols` and `outputCols` params, input/output cols
+ * come in pairs, specified by the order in the arrays, and each pair is treated independently.
+ *
* @see `StringIndexer` for converting categorical values into category indices
- * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`
- * will be removed in 3.0.0.
*/
-@Since("1.4.0")
-@deprecated("`OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`" +
- " will be removed in 3.0.0.", "2.3.0")
-class OneHotEncoder @Since("1.4.0") (@Since("1.4.0") override val uid: String) extends Transformer
- with HasInputCol with HasOutputCol with DefaultParamsWritable {
+@Since("2.3.0")
+class OneHotEncoder @Since("2.3.0") (@Since("2.3.0") override val uid: String)
--- End diff --
Or maybe since 3.0.0? It is a new class.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
retest this please...
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23100
**[Test build #99154 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99154/testReport)** for PR 23100 at commit [`64364b9`](https://github.com/apache/spark/commit/64364b9934d191842a2a68bf1c795d14f746f8bc).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r235760329
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
import org.apache.spark.ml.attribute._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
import org.apache.spark.ml.util._
import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+ with HasInputCols with HasOutputCols {
+
+ /**
+ * Param for how to handle invalid data during transform().
+ * Options are 'keep' (invalid data presented as an extra categorical feature) or
+ * 'error' (throw an error).
+ * Note that this Param is only used during transform; during fitting, invalid data
+ * will result in an error.
+ * Default: "error"
+ * @group param
+ */
+ @Since("2.3.0")
+ override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
+ "How to handle invalid data during transform(). " +
+ "Options are 'keep' (invalid data presented as an extra categorical feature) " +
+ "or error (throw an error). Note that this Param is only used during transform; " +
+ "during fitting, invalid data will result in an error.",
+ ParamValidators.inArray(OneHotEncoder.supportedHandleInvalids))
+
+ setDefault(handleInvalid, OneHotEncoder.ERROR_INVALID)
+
+ /**
+ * Whether to drop the last category in the encoded vector (default: true)
+ * @group param
+ */
+ @Since("2.3.0")
+ final val dropLast: BooleanParam =
+ new BooleanParam(this, "dropLast", "whether to drop the last category")
+ setDefault(dropLast -> true)
+
+ /** @group getParam */
+ @Since("2.3.0")
+ def getDropLast: Boolean = $(dropLast)
+
+ protected def validateAndTransformSchema(
+ schema: StructType,
+ dropLast: Boolean,
+ keepInvalid: Boolean): StructType = {
+ val inputColNames = $(inputCols)
+ val outputColNames = $(outputCols)
+
+ require(inputColNames.length == outputColNames.length,
+ s"The number of input columns ${inputColNames.length} must be the same as the number of " +
+ s"output columns ${outputColNames.length}.")
+
+ // Input columns must be NumericType.
+ inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
+
+ // Prepares output columns with proper attributes by examining input columns.
+ val inputFields = $(inputCols).map(schema(_))
+
+ val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) =>
+ OneHotEncoderCommon.transformOutputColumnSchema(
+ inputField, outputColName, dropLast, keepInvalid)
+ }
+ outputFields.foldLeft(schema) { case (newSchema, outputField) =>
+ SchemaUtils.appendColumn(newSchema, outputField)
+ }
+ }
+}
/**
* A one-hot encoder that maps a column of category indices to a column of binary vectors, with
* at most a single one-value per row that indicates the input category index.
* For example with 5 categories, an input value of 2.0 would map to an output vector of
* `[0.0, 0.0, 1.0, 0.0]`.
- * The last category is not included by default (configurable via `OneHotEncoder!.dropLast`
+ * The last category is not included by default (configurable via `dropLast`),
* because it makes the vector entries sum up to one, and hence linearly dependent.
* So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
*
* @note This is different from scikit-learn's OneHotEncoder, which keeps all categories.
* The output vectors are sparse.
*
+ * When `handleInvalid` is configured to 'keep', an extra "category" indicating invalid values is
+ * added as last category. So when `dropLast` is true, invalid values are encoded as all-zeros
+ * vector.
+ *
+ * @note When encoding multi-column by using `inputCols` and `outputCols` params, input/output cols
+ * come in pairs, specified by the order in the arrays, and each pair is treated independently.
+ *
* @see `StringIndexer` for converting categorical values into category indices
- * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`
- * will be removed in 3.0.0.
*/
-@Since("1.4.0")
-@deprecated("`OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`" +
- " will be removed in 3.0.0.", "2.3.0")
-class OneHotEncoder @Since("1.4.0") (@Since("1.4.0") override val uid: String) extends Transformer
- with HasInputCol with HasOutputCol with DefaultParamsWritable {
+@Since("2.3.0")
+class OneHotEncoder @Since("2.3.0") (@Since("2.3.0") override val uid: String)
--- End diff --
In this renaming case, I'm not sure we should use the `Since` from old `OneHotEncoder` (`1.4.0`) or `OneHotEncoderEstimator` (`2.3.0`). Now I use `OneHotEncoderEstimator`'s.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236097273
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
import org.apache.spark.ml.attribute._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols}
import org.apache.spark.ml.util._
import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+ with HasInputCols with HasOutputCols {
+
+ /**
+ * Param for how to handle invalid data during transform().
+ * Options are 'keep' (invalid data presented as an extra categorical feature) or
+ * 'error' (throw an error).
+ * Note that this Param is only used during transform; during fitting, invalid data
+ * will result in an error.
+ * Default: "error"
+ * @group param
+ */
+ @Since("2.3.0")
+ override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
+ "How to handle invalid data during transform(). " +
+ "Options are 'keep' (invalid data presented as an extra categorical feature) " +
+ "or error (throw an error). Note that this Param is only used during transform; " +
+ "during fitting, invalid data will result in an error.",
+ ParamValidators.inArray(OneHotEncoder.supportedHandleInvalids))
+
+ setDefault(handleInvalid, OneHotEncoder.ERROR_INVALID)
+
+ /**
+ * Whether to drop the last category in the encoded vector (default: true)
+ * @group param
+ */
+ @Since("2.3.0")
+ final val dropLast: BooleanParam =
+ new BooleanParam(this, "dropLast", "whether to drop the last category")
+ setDefault(dropLast -> true)
+
+ /** @group getParam */
+ @Since("2.3.0")
+ def getDropLast: Boolean = $(dropLast)
+
+ protected def validateAndTransformSchema(
+ schema: StructType,
+ dropLast: Boolean,
+ keepInvalid: Boolean): StructType = {
+ val inputColNames = $(inputCols)
+ val outputColNames = $(outputCols)
+
+ require(inputColNames.length == outputColNames.length,
+ s"The number of input columns ${inputColNames.length} must be the same as the number of " +
+ s"output columns ${outputColNames.length}.")
+
+ // Input columns must be NumericType.
+ inputColNames.foreach(SchemaUtils.checkNumericType(schema, _))
+
+ // Prepares output columns with proper attributes by examining input columns.
+ val inputFields = $(inputCols).map(schema(_))
+
+ val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) =>
+ OneHotEncoderCommon.transformOutputColumnSchema(
+ inputField, outputColName, dropLast, keepInvalid)
+ }
+ outputFields.foldLeft(schema) { case (newSchema, outputField) =>
+ SchemaUtils.appendColumn(newSchema, outputField)
+ }
+ }
+}
/**
* A one-hot encoder that maps a column of category indices to a column of binary vectors, with
* at most a single one-value per row that indicates the input category index.
* For example with 5 categories, an input value of 2.0 would map to an output vector of
* `[0.0, 0.0, 1.0, 0.0]`.
- * The last category is not included by default (configurable via `OneHotEncoder!.dropLast`
+ * The last category is not included by default (configurable via `dropLast`),
* because it makes the vector entries sum up to one, and hence linearly dependent.
* So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
*
* @note This is different from scikit-learn's OneHotEncoder, which keeps all categories.
* The output vectors are sparse.
*
+ * When `handleInvalid` is configured to 'keep', an extra "category" indicating invalid values is
+ * added as last category. So when `dropLast` is true, invalid values are encoded as all-zeros
+ * vector.
+ *
+ * @note When encoding multi-column by using `inputCols` and `outputCols` params, input/output cols
+ * come in pairs, specified by the order in the arrays, and each pair is treated independently.
+ *
* @see `StringIndexer` for converting categorical values into category indices
- * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`
- * will be removed in 3.0.0.
*/
-@Since("1.4.0")
-@deprecated("`OneHotEncoderEstimator` will be renamed `OneHotEncoder` and this `OneHotEncoder`" +
- " will be removed in 3.0.0.", "2.3.0")
-class OneHotEncoder @Since("1.4.0") (@Since("1.4.0") override val uid: String) extends Transformer
- with HasInputCol with HasOutputCol with DefaultParamsWritable {
+@Since("2.3.0")
+class OneHotEncoder @Since("2.3.0") (@Since("2.3.0") override val uid: String)
--- End diff --
I'd say Since 3.0.0, as it's essentially new at this name, and any similarity to the old class is coincidental.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99102/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99092/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236469693
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
--- End diff --
Yea, I've tried to remove old `OneHotEncoder` first and then do renaming. The git diff is still like this:
https://github.com/apache/spark/compare/master...viirya:remove_one_hot_encoder_test?expand=1
I'm ok if you prefer to have two PRs. WDYT? @srowen
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/23100
retest this please.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23100#discussion_r236097295
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
package org.apache.spark.ml.feature
+import org.apache.hadoop.fs.Path
--- End diff --
The changes here are basically the copy-paste from OneHotEncoderEstimator? I wonder if we could structure this as a delete, followed by move, in git, for a better history. But, doesn't matter much or maybe it's treated the same.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99161/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5444/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23100
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99154/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org