You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/08/17 10:25:26 UTC

[GitHub] spark pull request #18970: [SPARK-21468][PYSPARK][ML] Python API for Feature...

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/18970

    [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher

    Add Python API for `FeatureHasher` transformer.
    
    ## How was this patch tested?
    
    New doc test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark SPARK-21468-pyspark-hasher

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18970.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18970
    
----
commit 4ebd41e361f5afa753e73df0eac2cba3a92c1960
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-08-16T14:37:36Z

    Fix Scaladoc example code

commit 3ead289af15409d5ba55dd18322fd56cf7faef17
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-08-17T10:23:08Z

    Add Python API for FeatureHasher

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18970: [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18970
  
    **[Test build #80784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80784/testReport)** for PR 18970 at commit [`3ead289`](https://github.com/apache/spark/commit/3ead289af15409d5ba55dd18322fd56cf7faef17).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class FeatureHasher(JavaTransformer, HasInputCols, HasOutputCol, HasNumFeatures, JavaMLReadable,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18970: [SPARK-21468][PYSPARK][ML] Python API for Feature...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18970#discussion_r133791775
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -697,6 +698,82 @@ def getScalingVec(self):
     
     
     @inherit_doc
    +class FeatureHasher(JavaTransformer, HasInputCols, HasOutputCol, HasNumFeatures, JavaMLReadable,
    +                    JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Feature hashing projects a set of categorical or numerical features into a feature vector of
    +    specified dimension (typically substantially smaller than that of the original feature
    +    space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    +    to map features to indices in the feature vector.
    +
    +    The FeatureHasher transformer operates on multiple columns. Each column may contain either
    +    numeric or categorical features. Behavior and handling of column data types is as follows:
    +
    +    * Numeric columns:
    +        For numeric features, the hash value of the column name is used to map the
    +        feature value to its index in the feature vector. Numeric features are never
    +        treated as categorical, even when they are integers. You must explicitly
    +        convert numeric columns containing categorical features to strings first.
    +
    +    * String columns:
    +        For categorical features, the hash value of the string "column_name=value"
    +        is used to map to the vector index, with an indicator value of `1.0`.
    +        Thus, categorical features are "one-hot" encoded
    +        (similarly to using :py:class:`OneHotEncoder` with `dropLast=false`).
    +
    +    * Boolean columns:
    +        Boolean values are treated in the same way as string columns. That is,
    +        boolean features are represented as "column_name=true" or "column_name=false",
    +        with an indicator value of `1.0`.
    +
    +    Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    +
    +    Since a simple modulo is used to transform the hash function to a vector index,
    +    it is advisable to use a power of two as the `numFeatures` parameter;
    +    otherwise the features will not be mapped evenly to the vector indices.
    +
    +    >>> data = [(2.0, True, "1", "foo"), (3.0, False, "2", "bar")]
    +    >>> cols = ["real", "bool", "stringNum", "string"]
    +    >>> df = spark.createDataFrame(data, cols)
    +    >>> hasher = FeatureHasher(inputCols=cols, outputCol="features")
    +    >>> hasher.transform(df).head().features
    +    SparseVector(262144, {51871: 1.0, 63643: 1.0, 174475: 2.0, 253195: 1.0})
    +    >>> hasherPath = temp_path + "/hasher"
    +    >>> hasher.save(hasherPath)
    +    >>> loadedHasher = FeatureHasher.load(hasherPath)
    +    >>> loadedHasher.getNumFeatures() == hasher.getNumFeatures()
    +    True
    +    >>> loadedHasher.transform(df).head().features == hasher.transform(df).head().features
    +    True
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    @keyword_only
    +    def __init__(self, numFeatures=1 << 18, inputCols=None, outputCol=None):
    +        """
    +        __init__(self, numFeatures=1 << 18, inputCols=None, outputCol=None)
    +        """
    +        super(FeatureHasher, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.FeatureHasher", self.uid)
    +        self._setDefault(numFeatures=1 << 18)
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.3.0")
    +    def setParams(self, numFeatures=1 << 18, inputCols=None, outputCol=None):
    +        """
    +        setParams(self, numFeatures=1 << 18, inputCols=None, outputCol=None)
    +        Sets params for this FeatureHasher.
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +
    --- End diff --
    
    Nevermind, I forgot it's in the shared param `HasNumFeatures`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18970: [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/18970
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18970: [SPARK-21468][PYSPARK][ML] Python API for Feature...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18970#discussion_r133790849
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -697,6 +698,82 @@ def getScalingVec(self):
     
     
     @inherit_doc
    +class FeatureHasher(JavaTransformer, HasInputCols, HasOutputCol, HasNumFeatures, JavaMLReadable,
    +                    JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Feature hashing projects a set of categorical or numerical features into a feature vector of
    +    specified dimension (typically substantially smaller than that of the original feature
    +    space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
    +    to map features to indices in the feature vector.
    +
    +    The FeatureHasher transformer operates on multiple columns. Each column may contain either
    +    numeric or categorical features. Behavior and handling of column data types is as follows:
    +
    +    * Numeric columns:
    +        For numeric features, the hash value of the column name is used to map the
    +        feature value to its index in the feature vector. Numeric features are never
    +        treated as categorical, even when they are integers. You must explicitly
    +        convert numeric columns containing categorical features to strings first.
    +
    +    * String columns:
    +        For categorical features, the hash value of the string "column_name=value"
    +        is used to map to the vector index, with an indicator value of `1.0`.
    +        Thus, categorical features are "one-hot" encoded
    +        (similarly to using :py:class:`OneHotEncoder` with `dropLast=false`).
    +
    +    * Boolean columns:
    +        Boolean values are treated in the same way as string columns. That is,
    +        boolean features are represented as "column_name=true" or "column_name=false",
    +        with an indicator value of `1.0`.
    +
    +    Null (missing) values are ignored (implicitly zero in the resulting feature vector).
    +
    +    Since a simple modulo is used to transform the hash function to a vector index,
    +    it is advisable to use a power of two as the `numFeatures` parameter;
    +    otherwise the features will not be mapped evenly to the vector indices.
    +
    +    >>> data = [(2.0, True, "1", "foo"), (3.0, False, "2", "bar")]
    +    >>> cols = ["real", "bool", "stringNum", "string"]
    +    >>> df = spark.createDataFrame(data, cols)
    +    >>> hasher = FeatureHasher(inputCols=cols, outputCol="features")
    +    >>> hasher.transform(df).head().features
    +    SparseVector(262144, {51871: 1.0, 63643: 1.0, 174475: 2.0, 253195: 1.0})
    +    >>> hasherPath = temp_path + "/hasher"
    +    >>> hasher.save(hasherPath)
    +    >>> loadedHasher = FeatureHasher.load(hasherPath)
    +    >>> loadedHasher.getNumFeatures() == hasher.getNumFeatures()
    +    True
    +    >>> loadedHasher.transform(df).head().features == hasher.transform(df).head().features
    +    True
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    @keyword_only
    +    def __init__(self, numFeatures=1 << 18, inputCols=None, outputCol=None):
    +        """
    +        __init__(self, numFeatures=1 << 18, inputCols=None, outputCol=None)
    +        """
    +        super(FeatureHasher, self).__init__()
    +        self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.FeatureHasher", self.uid)
    +        self._setDefault(numFeatures=1 << 18)
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.3.0")
    +    def setParams(self, numFeatures=1 << 18, inputCols=None, outputCol=None):
    +        """
    +        setParams(self, numFeatures=1 << 18, inputCols=None, outputCol=None)
    +        Sets params for this FeatureHasher.
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +
    --- End diff --
    
    Should there be a `getNumFeatures()` method to return the param?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18970: [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18970
  
    Merged to master. Thanks for the review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18970: [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18970
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18970: [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18970
  
    **[Test build #80784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80784/testReport)** for PR 18970 at commit [`3ead289`](https://github.com/apache/spark/commit/3ead289af15409d5ba55dd18322fd56cf7faef17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18970: [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18970
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80784/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18970: [SPARK-21468][PYSPARK][ML] Python API for Feature...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18970


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org