You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/09/27 15:27:00 UTC

[jira] [Resolved] (SPARK-32973) FeatureHasher does not check categoricalCols in inputCols

     [ https://issues.apache.org/jira/browse/SPARK-32973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen resolved SPARK-32973.
----------------------------------
    Fix Version/s: 3.1.0
       Resolution: Fixed

Issue resolved by pull request 29868
[https://github.com/apache/spark/pull/29868]

> FeatureHasher does not check categoricalCols in inputCols
> ---------------------------------------------------------
>
>                 Key: SPARK-32973
>                 URL: https://issues.apache.org/jira/browse/SPARK-32973
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation, ML
>    Affects Versions: 2.3.0, 2.4.0, 3.0.0, 3.1.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Trivial
>             Fix For: 3.1.0
>
>
> doc related to {{categoricalCols}}:
> {code:java}
> Numeric columns to treat as categorical features. By default only string and boolean columns are treated as categorical, so this param can be used to explicitly specify the numerical columns to treat as categorical. Note, the relevant columns must also be set in inputCols. {code}
>  
> However, the check to make sure {{categoricalCols}} in {{inputCols}} was never implemented:
> for example, in 2.4.7 and current master(3.1.0):
> {code:java}
> scala> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.feature._
> scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}
> scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int", "string")
> df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field]
> scala> val n = 100
> n: Int = 100
> scala> val hasher = new FeatureHasher().setInputCols("int", "string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n) 
> hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f
> scala> hasher.transform(df).show
> +----+---+------+--------------------+
> |real|int|string|            features|
> +----+---+------+--------------------+
> | 2.0|  1|   foo|(100,[2,39],[1.0,...|
> | 3.0|  2|   bar|(100,[2,42],[2.0,...|
> +----+---+------+--------------------+
> {code}
>  
> CategoricalCols "real" is not in inputCols ("int", "string").
>  
> I think there are two options:
> 1, remove this comment  "Note, the relevant columns must also be set in inputCols. ", since this requirement seems unnecessary;
> 2, add a check to make sure all CategoricalCols are in inputCols.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org