You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/07/03 09:53:58 UTC

[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/18513

    [SPARK-13969][ML] Add FeatureHasher transformer

    This PR adds a `FeatureHasher` transformer, modeled on [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) and [Vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).
    
    The transformer operates on multiple input columns in one pass. Current behavior is:
    * for numerical columns, the values are assumed to be real values and the feature index is `hash(columnName)` while feature value is `feature_value`
    * for string columns, the values are assumed to be categorical and the feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
    * For hash collisions, feature values will be summed
    * `null` (missing) values are ignored
    
    The following dataframe illustrates the basic semantics:
    ```
    +---+------+-----+---------+------+-----------------------------------------+
    |int|double|float|stringNum|string|features                                 |
    +---+------+-----+---------+------+-----------------------------------------+
    |3  |4.0   |5.0  |1        |foo   |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
    |6  |7.0   |8.0  |2        |bar   |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
    +---+------+-----+---------+------+-----------------------------------------+
    ```
    
    ## How was this patch tested?
    
    New unit tests and manual experiments.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark FeatureHasher

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18513.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18513
    
----
commit 6ab19a963f35de29af0a6b7b1598d5add78f200a
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-08-23T10:29:06Z

    initial WIP

commit ebd2cbf3467f26121c602f7c77c2018253cbdf18
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-02-01T10:43:07Z

    Further work

commit ba255bfda792d58aaded892e49c6cf48f0391159
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-22T10:52:12Z

    Clean up

commit 0be1e6572110d7d550f69fd86d3dd4e96660fde6
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-22T10:52:37Z

    Add tests

commit 2f3ea21e2e1835d7218e8c7bd096cc0787ed595c
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-22T13:08:26Z

    Copy, save/load, clean up

commit 7d678fbf5f88d377b79153212a3e0a2596039b17
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-26T12:38:02Z

    Move numFeatures to HasNumFeatures shared trait

commit 60572776de80ebcf1782c3d7def749557c8bec61
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-07-03T07:18:25Z

    Update shared params from codegen run

commit 9edb3bda8cbc4e00f05b91718249edf2750fc028
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-07-03T09:32:32Z

    Update tests. Null values ignored in feature hashing.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127557871
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala ---
    @@ -0,0 +1,193 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types._
    +
    +class FeatureHasherSuite extends SparkFunSuite
    +  with MLlibTestSparkContext
    +  with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  import HashingTFSuite.murmur3FeatureIdx
    +
    +  implicit val vectorEncoder = ExpressionEncoder[Vector]()
    --- End diff --
    
    private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    @sethah thanks for reviewing. 
    
    _For the 1st question:_
    
    Yes, currently categorical columns that are numerical would need to be explicitly encoded as strings. I mentioned it as a follow up improvement. It's easy to handle, it's just the API for this I'm not certain of yet, here are the two options I see:
    
    1. User can specify param `categoricalCols` to explicitly set categorical cols. But, do we then assume that all other columns not in that list, that are strings, are categorical? i.e. this param is effectively only for numeric columns that must be treated as categorical? Or do we ignore all other non-numerical columns? etc
    2. User can specify param `realCols` to explicitly set the numeric columns. All other columns are treated as categorical.
    
    We could potentially offer both formats, though I tend to gravitate towards potentially (2) above, since the default use case will be encoding many (usually high cardinality) categorical columns, with maybe a few real columns in there.
    
    _For the second issue:_
    
    There is no way (at least that I know of) to provide a `dropLast` feature, since we don't know how many features there are - the whole point of hashing is not to keep the `feature <-> index` mapping for speed and memory efficiency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79934 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79934/testReport)** for PR 18513 at commit [`a91b53f`](https://github.com/apache/spark/commit/a91b53f7482b8a05734e77f42491a70f1e3e77f1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79092/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79092/testReport)** for PR 18513 at commit [`9edb3bd`](https://github.com/apache/spark/commit/9edb3bda8cbc4e00f05b91718249edf2750fc028).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79558/testReport)** for PR 18513 at commit [`b580a5c`](https://github.com/apache/spark/commit/b580a5c80421256e8d82f4e7cda7879ecc59bbbd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r128199981
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    --- End diff --
    
    Hmm, this is a method not a function - so I don't think it will be faster to do `val` in this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **Note 1**: this is distinct from `HashingTF` which handles vectorizing text to term frequencies (analogous to [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)). Thie feature hasher _could_ be extended to also handle `Seq[String]` input columns. But I feel it conflates concerns - e.g. `HashingTF` handles min term frequencies, binarization etc. 
    
    However we could later add basic support for `Seq[String]` columns - this would handle raw text in a similar way to Vowpal Wabbit, i.e. it all gets hashed into one feature vector (can be combined with namespaces later).
    
    **Note 2**: some potential follow ups:
    * support specifying categorical columns explicitly. This would be to allow forcing some columns that are in numerical format to be treated as categorical. Strings would still be treated as categorical.
    * support using the sign of hashed value as sign of feature value, and then support `non_negative` param (see [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html))
    * support feature namespaces and feature interactions similar to [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-interactions) (see [here](https://gist.github.com/luoq/b4c374b5cbabe3ae76ffacdac22750af) for an outline of the code used). This could provide an efficient and scalable form of `PolynomialExpansion`.
    
    cc @srowen @jkbradley @sethah @hhbyyh @yanboliang @BryanCutler @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #80724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80724/testReport)** for PR 18513 at commit [`d6a3117`](https://github.com/apache/spark/commit/d6a311748486490215264fbdc0a6f8cb4cf7e6e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79558/testReport)** for PR 18513 at commit [`b580a5c`](https://github.com/apache/spark/commit/b580a5c80421256e8d82f4e7cda7879ecc59bbbd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    @hhbyyh can you elaborate on your concerns in comment https://github.com/apache/spark/pull/18513#pullrequestreview-50194532?
    
    I tend to agree that the hasher is perhaps best used for categorical features, while known real features could be "assembled" onto the resulting hashed feature vector. However, one nice thing about hashing is it can handle everything at once in one pass. In practice even with very high cardinality categorical features and some real features, for the "normal" settings of hash bits, hash collision rate is relatively low, and has very little impact on performance (at least from my experiments). Of course it assumes highly sparse data - if the data is not sparse then it's usually best to use other mechanisms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Merged to master. Thanks all for reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127679107
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            // numeric values are kept as is, with vector index based on hash of "column_name"
    +            val value = getDouble(row.get(fieldIndex))
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            // string and boolean values are treated as categorical, with an indicator value of 1.0
    +            // and vector index based on hash of "column_name=value"
    +            val value = row.get(fieldIndex).toString
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = outputSchema($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
    +  }
    +
    +  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
    +
    +  override def transformSchema(schema: StructType): StructType = {
    +    val fields = schema($(inputCols).toSet)
    +    fields.foreach { case fieldSchema =>
    --- End diff --
    
    Again, think it was left over from some previous version, will update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Just to clarify:
    
    * If I want to treat a column as categorical that is represented by integers, I'd have to map those integers to strings, right? I believe that's one of your bullets above.
    * This is going to one-hot encoding on categorical columns, effectively, which is going to create linearly dependent columns since there is no parameter to drop the last column. Maybe there's a good solution, but I don't think we have to address it here. Just wanted to check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79961/testReport)** for PR 18513 at commit [`d6a3117`](https://github.com/apache/spark/commit/d6a311748486490215264fbdc0a6f8cb4cf7e6e1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    I've moved `HashingTF` `numFeatures` param to `sharedParams` which results in the MiMa failure since it would now be marked `final`. Can't quite recall what we've done previously in this case - whether we accept that it breaks user code, but that in most cases users should not have really been extending or overriding these params. Or leave it as is.
    
    I'm ok with the latter - `numFeatures` is not really that necessary to be a shared param. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r129748089
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,196 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may contain either
    + * numeric or categorical features. Behavior and handling of column data types is as follows:
    + *  -Numeric columns: For numeric features, the hash value of the column name is used to map the
    + *                    feature value to its index in the feature vector. Numeric features are never
    + *                    treated as categorical, even when they are integers. You must explicitly
    + *                    convert numeric columns containing categorical features to strings first.
    + *  -String columns: For categorical features, the hash value of the string "column_name=value"
    + *                   is used to map to the vector index, with an indicator value of `1.0`.
    + *                   Thus, categorical features are "one-hot" encoded
    + *                   (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *  -Boolean columns: Boolean values are treated in the same way as string columns. That is,
    + *                    boolean features are represented as "column_name=true" or "column_name=false",
    + *                    with an indicator value of `1.0`.
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Experimental
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  @Since("2.3.0")
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +    val localInputCols = $(inputCols)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      localInputCols.foreach { colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            // numeric values are kept as is, with vector index based on hash of "column_name"
    +            val value = getDouble(row.get(fieldIndex))
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            // string and boolean values are treated as categorical, with an indicator value of 1.0
    +            // and vector index based on hash of "column_name=value"
    +            val value = row.get(fieldIndex).toString
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = outputSchema($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct($(inputCols).map(col): _*)).as($(outputCol), metadata))
    +  }
    +
    +  @Since("2.3.0")
    +  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
    +
    +  @Since("2.3.0")
    +  override def transformSchema(schema: StructType): StructType = {
    +    val fields = schema($(inputCols).toSet)
    +    fields.foreach { fieldSchema =>
    +      val dataType = fieldSchema.dataType
    +      val fieldName = fieldSchema.name
    +      require(dataType.isInstanceOf[NumericType] ||
    +        dataType.isInstanceOf[StringType] ||
    +        dataType.isInstanceOf[BooleanType],
    +        s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " +
    +          s"Column $fieldName was $dataType")
    +    }
    +    val attrGroup = new AttributeGroup($(outputCol), $(numFeatures))
    --- End diff --
    
    It seems that we didn't store ```Attributes``` in the ```AttributeGroup```, but we did it in ```VectorAssembler```, and both of ```FeatureHasher``` and ```VectorAssembler``` can be followed with ML algorithms directly. I'd like to confirm is it intentional?I understand this may be due to performance considerations, and users may not interested to know the attribute of hashed features. We can leave as it is, until we find it affects some scenarios.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127558429
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            // numeric values are kept as is, with vector index based on hash of "column_name"
    +            val value = getDouble(row.get(fieldIndex))
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            // string and boolean values are treated as categorical, with an indicator value of 1.0
    +            // and vector index based on hash of "column_name=value"
    +            val value = row.get(fieldIndex).toString
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = outputSchema($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
    +  }
    +
    +  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
    --- End diff --
    
    since tags on all public methods (copy, transformSchema, transform)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127642851
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    --- End diff --
    
    Ah thanks - this was left over from a previous code version


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127491688
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala ---
    @@ -0,0 +1,193 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.functions.col
    +import org.apache.spark.sql.types._
    +
    +class FeatureHasherSuite extends SparkFunSuite
    +  with MLlibTestSparkContext
    +  with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +
    +  import HashingTFSuite.murmur3FeatureIdx
    +
    +  implicit val vectorEncoder = ExpressionEncoder[Vector]()
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new FeatureHasher)
    +  }
    +
    +  test("specify input cols using varargs or array") {
    +    val featureHasher1 = new FeatureHasher()
    +      .setInputCols("int", "double", "float", "stringNum", "string")
    +    val featureHasher2 = new FeatureHasher()
    +      .setInputCols(Array("int", "double", "float", "stringNum", "string"))
    +    assert(featureHasher1.getInputCols === featureHasher2.getInputCols)
    +  }
    +
    +  test("feature hashing") {
    +    val df = Seq(
    +      (2.0, true, "1", "foo"),
    +      (3.0, false, "2", "bar")
    +    ).toDF("real", "bool", "stringNum", "string")
    +
    +    val n = 100
    +    val hasher = new FeatureHasher()
    +      .setInputCols("real", "bool", "stringNum", "string")
    +      .setOutputCol("features")
    +      .setNumFeatures(n)
    +    val output = hasher.transform(df)
    +    val attrGroup = AttributeGroup.fromStructField(output.schema("features"))
    +    require(attrGroup.numAttributes === Some(n))
    --- End diff --
    
    make this an `assert`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    @hhbyyh thanks for the comments. Have updated accordingly.
    
    Thought about it and while `numFeatures` could be shared, it's only 2 transformers, so to avoid any binary compat issues I backed out the shared param version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127498459
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    --- End diff --
    
    case does nothing here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79699/testReport)** for PR 18513 at commit [`990b816`](https://github.com/apache/spark/commit/990b816428f8e5b94c08749650be05a3f52d07db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127554746
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            // numeric values are kept as is, with vector index based on hash of "column_name"
    +            val value = getDouble(row.get(fieldIndex))
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            // string and boolean values are treated as categorical, with an indicator value of 1.0
    +            // and vector index based on hash of "column_name=value"
    +            val value = row.get(fieldIndex).toString
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = outputSchema($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
    +  }
    +
    +  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
    +
    +  override def transformSchema(schema: StructType): StructType = {
    +    val fields = schema($(inputCols).toSet)
    +    fields.foreach { case fieldSchema =>
    --- End diff --
    
    case does nothing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79934/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Created https://issues.apache.org/jira/browse/SPARK-21468 and https://issues.apache.org/jira/browse/SPARK-21469 for docs and Python API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126505794
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasNumFeatures, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with HasNumFeatures with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val os = transformSchema(dataset.schema)
    +
    +    val featureCols = $(inputCols).map { colName =>
    +      val field = dataset.schema(colName)
    +      field.dataType match {
    +        case DoubleType | StringType => dataset(field.name)
    +        case _: NumericType | BooleanType => dataset(field.name).cast(DoubleType).alias(field.name)
    +      }
    +    }
    +
    +    val realFields = os.fields.filter(f => f.dataType.isInstanceOf[NumericType]).map(_.name).toSet
    +
    +    def hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            val value = row.getDouble(fieldIndex)
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            val value = row.getString(fieldIndex)
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = os($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct(featureCols: _*)).as($(outputCol), metadata))
    +  }
    +
    +  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
    +
    +  override def transformSchema(schema: StructType): StructType = {
    +    val fields = schema($(inputCols).toSet)
    +    require(fields.map(_.dataType).forall { case dt =>
    +      dt.isInstanceOf[NumericType] || dt.isInstanceOf[StringType]
    --- End diff --
    
    require message


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79092/testReport)** for PR 18513 at commit [`9edb3bd`](https://github.com/apache/spark/commit/9edb3bda8cbc4e00f05b91718249edf2750fc028).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127589970
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    --- End diff --
    
    maybe `val getDouble...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127590053
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            // numeric values are kept as is, with vector index based on hash of "column_name"
    +            val value = getDouble(row.get(fieldIndex))
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            // string and boolean values are treated as categorical, with an indicator value of 1.0
    +            // and vector index based on hash of "column_name=value"
    +            val value = row.get(fieldIndex).toString
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = outputSchema($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct($(inputCols).map(col(_)): _*)).as($(outputCol), metadata))
    --- End diff --
    
    .map(col)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79961/testReport)** for PR 18513 at commit [`d6a3117`](https://github.com/apache/spark/commit/d6a311748486490215264fbdc0a6f8cb4cf7e6e1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Thanks @sethah @hhbyyh for the review. I updated the behavior doc string as suggested. 
    
    Any other comments? cc @srowen @jkbradley @yanboliang


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18513


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127642922
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    --- End diff --
    
    why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #80724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80724/testReport)** for PR 18513 at commit [`d6a3117`](https://github.com/apache/spark/commit/d6a311748486490215264fbdc0a6f8cb4cf7e6e1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r127555147
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      $(inputCols).foreach { case colName =>
    --- End diff --
    
    also, I think you'll serialize the entire object here by using `$(inputCols)`. Maybe you can make a local pointer to it before the udf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r129801204
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,196 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may contain either
    + * numeric or categorical features. Behavior and handling of column data types is as follows:
    + *  -Numeric columns: For numeric features, the hash value of the column name is used to map the
    + *                    feature value to its index in the feature vector. Numeric features are never
    + *                    treated as categorical, even when they are integers. You must explicitly
    + *                    convert numeric columns containing categorical features to strings first.
    + *  -String columns: For categorical features, the hash value of the string "column_name=value"
    + *                   is used to map to the vector index, with an indicator value of `1.0`.
    + *                   Thus, categorical features are "one-hot" encoded
    + *                   (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *  -Boolean columns: Boolean values are treated in the same way as string columns. That is,
    + *                    boolean features are represented as "column_name=true" or "column_name=false",
    + *                    with an indicator value of `1.0`.
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Experimental
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  @Since("2.3.0")
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +    val localInputCols = $(inputCols)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      localInputCols.foreach { colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            // numeric values are kept as is, with vector index based on hash of "column_name"
    +            val value = getDouble(row.get(fieldIndex))
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            // string and boolean values are treated as categorical, with an indicator value of 1.0
    +            // and vector index based on hash of "column_name=value"
    +            val value = row.get(fieldIndex).toString
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = outputSchema($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct($(inputCols).map(col): _*)).as($(outputCol), metadata))
    +  }
    +
    +  @Since("2.3.0")
    +  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
    +
    +  @Since("2.3.0")
    +  override def transformSchema(schema: StructType): StructType = {
    +    val fields = schema($(inputCols).toSet)
    +    fields.foreach { fieldSchema =>
    +      val dataType = fieldSchema.dataType
    +      val fieldName = fieldSchema.name
    +      require(dataType.isInstanceOf[NumericType] ||
    +        dataType.isInstanceOf[StringType] ||
    +        dataType.isInstanceOf[BooleanType],
    +        s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " +
    +          s"Column $fieldName was $dataType")
    +    }
    +    val attrGroup = new AttributeGroup($(outputCol), $(numFeatures))
    --- End diff --
    
    Feature hashing doesn't keep the feature -> idx mapping for memory efficiency, so by extension it won't keep attribute info. This is by design, and the tradeoff is speed & efficiency vs. not being able to do the reverse mapping (or knowing the cardinality of each feature, for example).
    
    If users want to keep the mapping & attribute info, then of course they can just use one-hot encoding and vector assembler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79934 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79934/testReport)** for PR 18513 at commit [`a91b53f`](https://github.com/apache/spark/commit/a91b53f7482b8a05734e77f42491a70f1e3e77f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80724/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r128786226
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,189 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    --- End diff --
    
    It might be good to make the behavior for each type of column clearer here. Specifically for numeric columns that are meant to be categories. Something like:
    
    ````scala
    /**
     * Behavior
     *  -Numeric columns: For numeric features, the hash value of the column name is used to map the
     *                    feature value to its index in the feature vector. Numeric features are never
     *                    treated as categorical, even when they are integers. You must convert
     *                    categorical columns to strings first.
     *  -String columns: ...
     *  -Boolean columns: ...
     */
    ````
    
    Anyway, this is a very minor suggestion and I think it's also ok to leave as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126508854
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala ---
    @@ -0,0 +1,140 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.{Vector, Vectors}
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.ml.util.TestingUtils._
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +
    +class FeatureHasherSuite extends SparkFunSuite
    +  with MLlibTestSparkContext
    +  with DefaultReadWriteTest {
    +
    +  import testImplicits._
    +  import HashingTFSuite.murmur3FeatureIdx
    +
    +  implicit val vectorEncoder = ExpressionEncoder[Vector]()
    +
    +  test("params") {
    --- End diff --
    
    Maybe add a test for the Unicode column name (like Chinese, "中文") 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79558/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79961/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126503993
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasNumFeatures, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with HasNumFeatures with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    --- End diff --
    
    need a way to know the default value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79699/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **[Test build #79699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79699/testReport)** for PR 18513 at commit [`990b816`](https://github.com/apache/spark/commit/990b816428f8e5b94c08749650be05a3f52d07db).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126503728
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasNumFeatures, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +
    --- End diff --
    
    comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r129839492
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,196 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may contain either
    + * numeric or categorical features. Behavior and handling of column data types is as follows:
    + *  -Numeric columns: For numeric features, the hash value of the column name is used to map the
    + *                    feature value to its index in the feature vector. Numeric features are never
    + *                    treated as categorical, even when they are integers. You must explicitly
    + *                    convert numeric columns containing categorical features to strings first.
    + *  -String columns: For categorical features, the hash value of the string "column_name=value"
    + *                   is used to map to the vector index, with an indicator value of `1.0`.
    + *                   Thus, categorical features are "one-hot" encoded
    + *                   (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *  -Boolean columns: Boolean values are treated in the same way as string columns. That is,
    + *                    boolean features are represented as "column_name=true" or "column_name=false",
    + *                    with an indicator value of `1.0`.
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Experimental
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  @Since("2.3.0")
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +    val localInputCols = $(inputCols)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    +      x match {
    +        case n: java.lang.Number =>
    +          n.doubleValue()
    +        case other =>
    +          // will throw ClassCastException if it cannot be cast, as would row.getDouble
    +          other.asInstanceOf[Double]
    +      }
    +    }
    +
    +    val hashFeatures = udf { row: Row =>
    +      val map = new OpenHashMap[Int, Double]()
    +      localInputCols.foreach { colName =>
    +        val fieldIndex = row.fieldIndex(colName)
    +        if (!row.isNullAt(fieldIndex)) {
    +          val (rawIdx, value) = if (realFields(colName)) {
    +            // numeric values are kept as is, with vector index based on hash of "column_name"
    +            val value = getDouble(row.get(fieldIndex))
    +            val hash = hashFunc(colName)
    +            (hash, value)
    +          } else {
    +            // string and boolean values are treated as categorical, with an indicator value of 1.0
    +            // and vector index based on hash of "column_name=value"
    +            val value = row.get(fieldIndex).toString
    +            val fieldName = s"$colName=$value"
    +            val hash = hashFunc(fieldName)
    +            (hash, 1.0)
    +          }
    +          val idx = Utils.nonNegativeMod(rawIdx, n)
    +          map.changeValue(idx, value, v => v + value)
    +        }
    +      }
    +      Vectors.sparse(n, map.toSeq)
    +    }
    +
    +    val metadata = outputSchema($(outputCol)).metadata
    +    dataset.select(
    +      col("*"),
    +      hashFeatures(struct($(inputCols).map(col): _*)).as($(outputCol), metadata))
    +  }
    +
    +  @Since("2.3.0")
    +  override def copy(extra: ParamMap): FeatureHasher = defaultCopy(extra)
    +
    +  @Since("2.3.0")
    +  override def transformSchema(schema: StructType): StructType = {
    +    val fields = schema($(inputCols).toSet)
    +    fields.foreach { fieldSchema =>
    +      val dataType = fieldSchema.dataType
    +      val fieldName = fieldSchema.name
    +      require(dataType.isInstanceOf[NumericType] ||
    +        dataType.isInstanceOf[StringType] ||
    +        dataType.isInstanceOf[BooleanType],
    +        s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " +
    +          s"Column $fieldName was $dataType")
    +    }
    +    val attrGroup = new AttributeGroup($(outputCol), $(numFeatures))
    --- End diff --
    
    @MLnick Thanks for clarifying. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Let's make sure to create doc and python JIRAs before this gets merged btw.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126947021
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasNumFeatures, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with HasNumFeatures with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val os = transformSchema(dataset.schema)
    +
    +    val featureCols = $(inputCols).map { colName =>
    +      val field = dataset.schema(colName)
    +      field.dataType match {
    +        case DoubleType | StringType => dataset(field.name)
    +        case _: NumericType | BooleanType => dataset(field.name).cast(DoubleType).alias(field.name)
    --- End diff --
    
    Fair point, have updated to handle this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126947075
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasNumFeatures, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +
    --- End diff --
    
    Yup, forgot that!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126507840
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasNumFeatures, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with HasNumFeatures with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val os = transformSchema(dataset.schema)
    +
    +    val featureCols = $(inputCols).map { colName =>
    +      val field = dataset.schema(colName)
    +      field.dataType match {
    +        case DoubleType | StringType => dataset(field.name)
    +        case _: NumericType | BooleanType => dataset(field.name).cast(DoubleType).alias(field.name)
    --- End diff --
    
    Is it possible to avoid casting to Double, since one key target of Feature Hashing is reducing memory usage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r128043915
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,185 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +/**
    + * Feature hashing projects a set of categorical or numerical features into a feature vector of
    + * specified dimension (typically substantially smaller than that of the original feature
    + * space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    + * to map features to indices in the feature vector.
    + *
    + * The [[FeatureHasher]] transformer operates on multiple columns. Each column may be numeric
    + * (representing a real feature) or string (representing a categorical feature). Boolean columns
    + * are also supported, and treated as categorical features. For numeric features, the hash value of
    + * the column name is used to map the feature value to its index in the feature vector.
    + * For categorical features, the hash value of the string "column_name=value" is used to map to the
    + * vector index, with an indicator value of `1.0`. Thus, categorical features are "one-hot" encoded
    + * (similarly to using [[OneHotEncoder]] with `dropLast=false`).
    + *
    + * Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    + *
    + * Since a simple modulo is used to transform the hash function to a vector index,
    + * it is advisable to use a power of two as the numFeatures parameter;
    + * otherwise the features will not be mapped evenly to the vector indices.
    + *
    + * {{{
    + *   val df = Seq(
    + *    (2.0, true, "1", "foo"),
    + *    (3.0, false, "2", "bar")
    + *   ).toDF("real", "bool", "stringNum", "string")
    + *
    + *   val hasher = new FeatureHasher()
    + *    .setInputCols("real", "bool", "stringNum", "num")
    + *    .setOutputCol("features")
    + *
    + *   hasher.transform(df).show()
    + *
    + *   +----+-----+---------+------+--------------------+
    + *   |real| bool|stringNum|string|            features|
    + *   +----+-----+---------+------+--------------------+
    + *   | 2.0| true|        1|   foo|(262144,[51871,63...|
    + *   | 3.0|false|        2|   bar|(262144,[6031,806...|
    + *   +----+-----+---------+------+--------------------+
    + * }}}
    + */
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /**
    +   * Number of features. Should be greater than 0.
    +   * (default = 2^18^)
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val numFeatures = new IntParam(this, "numFeatures", "number of features (> 0)",
    +    ParamValidators.gt(0))
    +
    +  setDefault(numFeatures -> (1 << 18))
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getNumFeatures: Int = $(numFeatures)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(values: String*): this.type = setInputCols(values.toArray)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  override def transform(dataset: Dataset[_]): DataFrame = {
    +    val hashFunc: Any => Int = OldHashingTF.murmur3Hash
    +    val n = $(numFeatures)
    +
    +    val outputSchema = transformSchema(dataset.schema)
    +    val realFields = outputSchema.fields.filter { f =>
    +      f.dataType.isInstanceOf[NumericType]
    +    }.map(_.name).toSet
    +
    +    def getDouble(x: Any): Double = {
    --- End diff --
    
    I read it from here, but never tested it.
    https://stackoverflow.com/questions/18887264/what-is-the-difference-between-def-and-val-to-define-a-function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18513#discussion_r126947047
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala ---
    @@ -0,0 +1,119 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.Since
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.attribute.AttributeGroup
    +import org.apache.spark.ml.linalg.Vectors
    +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
    +import org.apache.spark.ml.param.shared.{HasInputCols, HasNumFeatures, HasOutputCol}
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils}
    +import org.apache.spark.mllib.feature.{HashingTF => OldHashingTF}
    +import org.apache.spark.sql.{DataFrame, Dataset, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +import org.apache.spark.util.collection.OpenHashMap
    +
    +
    +@Since("2.3.0")
    +class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transformer
    +  with HasInputCols with HasOutputCol with HasNumFeatures with DefaultParamsWritable {
    +
    +  @Since("2.3.0")
    +  def this() = this(Identifiable.randomUID("featureHasher"))
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
    --- End diff --
    
    Not sure what you mean exactly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org