You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/01/11 07:06:45 UTC
[GitHub] spark pull request #20229: [SPARK-23037][ML] Update RFormula to use VectorSi...
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/20229#discussion_r160877856
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala ---
@@ -228,22 +229,33 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
// Then we handle one-hot encoding and interactions between terms.
var keepReferenceCategory = false
val encodedTerms = resolvedFormula.terms.map {
- case Seq(term) if dataset.schema(term).dataType == StringType =>
- val encodedCol = tmpColumn("onehot")
- var encoder = new OneHotEncoder()
- .setInputCol(indexed(term))
- .setOutputCol(encodedCol)
- // Formula w/o intercept, one of the categories in the first category feature is
- // being used as reference category, we will not drop any category for that feature.
- if (!hasIntercept && !keepReferenceCategory) {
- encoder = encoder.setDropLast(false)
- keepReferenceCategory = true
- }
- encoderStages += encoder
- prefixesToRewrite(encodedCol + "_") = term + "_"
- encodedCol
case Seq(term) =>
- term
+ dataset.schema(term).dataType match {
+ case _: StringType =>
+ val encodedCol = tmpColumn("onehot")
+ // Formula w/o intercept, one of the categories in the first category feature is
+ // being used as reference category, we will not drop any category for that feature.
+ if (!hasIntercept && !keepReferenceCategory) {
+ keepReferenceCategory = true
+ encoderStages += new OneHotEncoderEstimator(uid)
--- End diff --
As `OneHotEncoderEstimator` can handle multi-column now, can't we just have one `OneHotEncoderEstimator` to fit for all terms at once?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org