You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ajaysaini725 <gi...@git.apache.org> on 2017/07/27 05:06:11 UTC

[GitHub] spark pull request #18746: Implemented UnaryTransformer in Python

GitHub user ajaysaini725 opened a pull request:

    https://github.com/apache/spark/pull/18746

    Implemented UnaryTransformer in Python

    ## What changes were proposed in this pull request?
    
    Implemented UnaryTransformer in Python
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    UnaryTransformer is an abstract class with only two pre-implemented functions provided. It wasn't clear how to test this without specifically instantiating a UnaryTransformer so no unit tests were provided. In Scala, UnaryTransformer is not tested so as long as the Python implementation follows the method used in Scala, it should be fine. It would be good to include examples in the docs of specific instances of UnaryTransformer. When combined with the change in: https://github.com/apache/spark/pull/18742, this patch makes it possible to implement custom Python-only UnaryTransformers and persist them in Python (i.e. without any Java dependence).


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajaysaini725/spark AddPythonUnaryTransformer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18746.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18746
    
----
commit 10ab7bc336c00fd6e010cd4301d03eed31b0c95c
Author: Ajay Saini <aj...@gmail.com>
Date:   2017-07-10T21:06:06Z

    Started

commit 4fe94202819024ad82661004f59922f88cc3b7c8
Author: Ajay Saini <aj...@gmail.com>
Date:   2017-07-10T21:06:06Z

    Started

commit 7d25c70d4882daea0a7d6dcc2318c06682c00eca
Author: Ajay Saini <aj...@gmail.com>
Date:   2017-07-27T05:01:39Z

    Implemented Unary Transforme in Python.

commit 960de95330f31c46bdfde8f3f7d634d0edac18d4
Author: Ajay Saini <aj...@gmail.com>
Date:   2017-07-27T05:02:50Z

    Fixed merge conflict.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131222222
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,53 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +    """
    +    Abstract class for transformers that tae one input column, apply a transoformation to it,
    --- End diff --
    
    typo: tae


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [SPARK-21633][ML][Python] UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #80228 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80228/testReport)** for PR 18746 at commit [`a30ae39`](https://github.com/apache/spark/commit/a30ae399923f2d4cef2cc00d88cae85fa5a2dde7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [SPARK-21633][ML][Python] UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #80228 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80228/testReport)** for PR 18746 at commit [`a30ae39`](https://github.com/apache/spark/commit/a30ae399923f2d4cef2cc00d88cae85fa5a2dde7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [SPARK-21633][ML][Python] UnaryTransformer in Pyt...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18746


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80183/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131258476
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1957,6 +1987,24 @@ def test_chisquaretest(self):
             self.assertTrue(all(field in fieldNames for field in expectedFields))
     
     
    +class UnaryTransformerTests(SparkSessionTestCase):
    +
    +    def test_unary_transformer_transform(self):
    --- End diff --
    
    Could you also please test validateInputType?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [SPARK-21633][ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    LGTM
    Merging with master
    Thanks @ajaysaini725 !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131258120
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1957,6 +1987,24 @@ def test_chisquaretest(self):
             self.assertTrue(all(field in fieldNames for field in expectedFields))
     
     
    +class UnaryTransformerTests(SparkSessionTestCase):
    +
    +    def test_unary_transformer_transform(self):
    +        shiftVal = 3
    +        transformer = MockUnaryTransformer(shiftVal=shiftVal)\
    +            .setInputCol("input").setOutputCol("output")
    +
    +        df = self.spark.range(0, 10).toDF('input')
    +        df = df.withColumn("input", df.input.cast(dataType="double"))
    +
    +        transformed_df = transformer.transform(df)
    +        output = transformed_df.select("output").collect()
    --- End diff --
    
    It's better practice to select both input & output and collect both for comparison, rather than relying on DataFrame rows maintaining their order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80225/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #80225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80225/testReport)** for PR 18746 at commit [`527bc88`](https://github.com/apache/spark/commit/527bc884a29df0d0e105912557a0f9fb205f1ba9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #80225 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80225/testReport)** for PR 18746 at commit [`527bc88`](https://github.com/apache/spark/commit/527bc884a29df0d0e105912557a0f9fb205f1ba9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    @ajaysaini725 Is there a JIRA for this PR?  Please tag this PR in the title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #80183 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80183/testReport)** for PR 18746 at commit [`692aa5d`](https://github.com/apache/spark/commit/692aa5d89d0446cba09db8ce70c4587cce043704).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131223795
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,53 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +    """
    +    Abstract class for transformers that tae one input column, apply a transoformation to it,
    +    and output the result as a new column.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    @abstractmethod
    +    def createTransformFunc(self):
    +        """
    +        Creates the transoform function using the given param map.
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def outputDataType(self):
    +        """
    +        Returns the data type of the output column as a sql type
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def validateInputType(self, inputType):
    +        """
    +        Validates the input type. Throws an exception if it is invalid.
    +        """
    +        raise NotImplementedError()
    +
    +    def transformSchema(self, schema):
    +        inputType = schema[self.getInputCol()].dataType
    +        self.validateInputType(inputType)
    +        if self.getOutputCol() in schema.names:
    +            raise ValueError("Output column %s already exists." % self.getOutputCol())
    +        outputFields = copy.copy(schema.fields)
    +        outputFields.append(StructField(self.getOutputCol(),
    +                                        self.outputDataType(),
    +                                        nullable=False))
    +        return StructType(outputFields)
    +
    +    def _transform(self, dataset):
    +        self.transformSchema(dataset.schema)
    +        transformFunc = self.createTransformFunc()
    --- End diff --
    
    remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Also, you can remove "implemented" from the title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #80183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80183/testReport)** for PR 18746 at commit [`692aa5d`](https://github.com/apache/spark/commit/692aa5d89d0446cba09db8ce70c4587cce043704).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131222290
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,53 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +    """
    +    Abstract class for transformers that tae one input column, apply a transoformation to it,
    --- End diff --
    
    Actually multiple typos.  Why not just copy the text from Scala?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: Implemented UnaryTransformer in Python

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    @jkbradley @thunterdb @MrBago Could you please review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79991/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [SPARK-21633][ML][Python] UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80228/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r130940500
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,44 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +
    +    @abstractmethod
    +    def createTransformFunc(self):
    +        """
    +        Creates the transoform function using the given param map.
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def outputDataType(self):
    +        """
    +        Returns the data type of the output column as a sql type
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def validateInputType(self, inputType):
    +        """
    +        Validates the input type. Throws an exception if it is invalid.
    +        """
    +        raise NotImplementedError()
    +
    +    def transformSchema(self, schema):
    +        inputType = schema[self.getInputCol()].dataType
    +        self.validateInputType(inputType)
    +        if self.getOutputCol() in schema.names:
    +            raise ValueError("Output column %s already exists." % self.getOutputCol())
    +        outputFields = copy.copy(schema.fields)
    +        outputFields.append(StructField(self.getOutputCol(),
    +                                        self.outputDataType(),
    +                                        nullable=False))
    +        return StructType(outputFields)
    +
    +    def transform(self, dataset, paramMap=None):
    +        transformSchema(dataset.schema())
    +        transformUDF = udf(self.createTransformFunc(), self.outputDataType())
    +        dataset.withColumn(self.getOutputCol(), transformUDF(self.getInputCol()))
    --- End diff --
    
    It seems that we should use
    `transformUDF = udf(self.createTransformFunc, self.outputDataType())`
    instead of
    `transformUDF = udf(self.createTransformFunc(), self.outputDataType())`
    ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131222861
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,53 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +    """
    +    Abstract class for transformers that tae one input column, apply a transoformation to it,
    +    and output the result as a new column.
    +
    +    .. versionadded:: 2.3.0
    +    """
    +
    +    @abstractmethod
    +    def createTransformFunc(self):
    +        """
    +        Creates the transoform function using the given param map.
    --- End diff --
    
    Please use the IntelliJ spellcheck feature


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131293810
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1957,6 +1988,40 @@ def test_chisquaretest(self):
             self.assertTrue(all(field in fieldNames for field in expectedFields))
     
     
    +class UnaryTransformerTests(SparkSessionTestCase):
    +
    +    def test_unary_transformer_validate_input_type(self):
    +        shiftVal = 3
    +        transformer = MockUnaryTransformer(shiftVal=shiftVal)\
    +            .setInputCol("input").setOutputCol("output")
    +
    +        # should not raise any errors
    +        transformer.validateInputType(DoubleType())
    +
    +        with self.assertRaises(TypeError):
    +            # passing the wrong input type should raise an error
    +            transformer.validateInputType(IntegerType())
    +
    +    def test_unary_transformer_transform(self):
    +        shiftVal = 3
    +        transformer = MockUnaryTransformer(shiftVal=shiftVal)\
    +            .setInputCol("input").setOutputCol("output")
    +
    +        df = self.spark.range(0, 10).toDF('input')
    +        df = df.withColumn("input", df.input.cast(dataType="double"))
    +
    +        transformed_df = transformer.transform(df)
    +        inputCol = transformed_df.select("input").collect()
    --- End diff --
    
    Do this instead:
    ```
    results = transformed_df.select("input", "output").collect()
    for res in results:
       self.assertEqual(res.input + shiftVal, res.output)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79988/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r131257864
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1957,6 +1987,24 @@ def test_chisquaretest(self):
             self.assertTrue(all(field in fieldNames for field in expectedFields))
     
     
    +class UnaryTransformerTests(SparkSessionTestCase):
    +
    +    def test_unary_transformer_transform(self):
    +        shiftVal = 3
    +        transformer = MockUnaryTransformer(shiftVal=shiftVal)\
    +            .setInputCol("input").setOutputCol("output")
    --- End diff --
    
    style: indent 2 spaces, not 4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [SPARK-21633][ML][Python] UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] UnaryTransformer in Python

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    @ajaysaini725 Is there a JIRA for this PR? Please tag this PR in the title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #79988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79988/testReport)** for PR 18746 at commit [`960de95`](https://github.com/apache/spark/commit/960de95330f31c46bdfde8f3f7d634d0edac18d4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #79988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79988/testReport)** for PR 18746 at commit [`960de95`](https://github.com/apache/spark/commit/960de95330f31c46bdfde8f3f7d634d0edac18d4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r130775557
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,44 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +
    +    @abstractmethod
    +    def createTransformFunc(self):
    +        """
    +        Creates the transoform function using the given param map.
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def outputDataType(self):
    +        """
    +        Returns the data type of the output column as a sql type
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def validateInputType(self, inputType):
    +        """
    +        Validates the input type. Throws an exception if it is invalid.
    +        """
    +        raise NotImplementedError()
    +
    +    def transformSchema(self, schema):
    +        inputType = schema[self.getInputCol()].dataType
    +        self.validateInputType(inputType)
    +        if self.getOutputCol() in schema.names:
    +            raise ValueError("Output column %s already exists." % self.getOutputCol())
    +        outputFields = copy.copy(schema.fields)
    +        outputFields.append(StructField(self.getOutputCol(),
    +                                        self.outputDataType(),
    +                                        nullable=False))
    +        return StructType(outputFields)
    +
    +    def transform(self, dataset, paramMap=None):
    +        transformSchema(dataset.schema())
    --- End diff --
    
    Right, I accidentally overrode transform instead of _transform. Fixed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...

Posted by ajaysaini725 <gi...@git.apache.org>.

Github user ajaysaini725 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r130775517
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,44 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +
    +    @abstractmethod
    +    def createTransformFunc(self):
    +        """
    +        Creates the transoform function using the given param map.
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def outputDataType(self):
    +        """
    +        Returns the data type of the output column as a sql type
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def validateInputType(self, inputType):
    +        """
    +        Validates the input type. Throws an exception if it is invalid.
    +        """
    +        raise NotImplementedError()
    +
    +    def transformSchema(self, schema):
    +        inputType = schema[self.getInputCol()].dataType
    +        self.validateInputType(inputType)
    +        if self.getOutputCol() in schema.names:
    +            raise ValueError("Output column %s already exists." % self.getOutputCol())
    +        outputFields = copy.copy(schema.fields)
    +        outputFields.append(StructField(self.getOutputCol(),
    +                                        self.outputDataType(),
    +                                        nullable=False))
    +        return StructType(outputFields)
    +
    +    def transform(self, dataset, paramMap=None):
    +        transformSchema(dataset.schema())
    +        transformUDF = udf(self.createTransformFunc(), self.outputDataType())
    +        dataset.withColumn(self.getOutputCol(), transformUDF(self.getInputCol()))
    --- End diff --
    
    self.createTransformFunc returns a function which is passed to the udf so in this case I think it is okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #79991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79991/testReport)** for PR 18746 at commit [`11f8f29`](https://github.com/apache/spark/commit/11f8f29efdfc9557f379fbb1bb48724d16ec75fb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r130518147
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,44 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +
    +    @abstractmethod
    +    def createTransformFunc(self):
    +        """
    +        Creates the transoform function using the given param map.
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def outputDataType(self):
    +        """
    +        Returns the data type of the output column as a sql type
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def validateInputType(self, inputType):
    +        """
    +        Validates the input type. Throws an exception if it is invalid.
    +        """
    +        raise NotImplementedError()
    +
    +    def transformSchema(self, schema):
    +        inputType = schema[self.getInputCol()].dataType
    +        self.validateInputType(inputType)
    +        if self.getOutputCol() in schema.names:
    +            raise ValueError("Output column %s already exists." % self.getOutputCol())
    +        outputFields = copy.copy(schema.fields)
    +        outputFields.append(StructField(self.getOutputCol(),
    +                                        self.outputDataType(),
    +                                        nullable=False))
    +        return StructType(outputFields)
    +
    +    def transform(self, dataset, paramMap=None):
    +        transformSchema(dataset.schema())
    --- End diff --
    
    Here seems exist some problem.
    The transform provide `paramMap`, but `createTransformFunc` has no way to get the passed in `paramMap`, here lost something I think.
    Because custom UnaryTransformer will only need to override the `createTransformFunc`, the base class need to handle the passed in `paramMap` properly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    **[Test build #79991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79991/testReport)** for PR 18746 at commit [`11f8f29`](https://github.com/apache/spark/commit/11f8f29efdfc9557f379fbb1bb48724d16ec75fb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18746: [ML][Python] Implemented UnaryTransformer in Python

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18746
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18746#discussion_r130518773
  
    --- Diff: python/pyspark/ml/base.py ---
    @@ -116,3 +121,44 @@ class Model(Transformer):
         """
     
         __metaclass__ = ABCMeta
    +
    +
    +@inherit_doc
    +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
    +
    +    @abstractmethod
    +    def createTransformFunc(self):
    +        """
    +        Creates the transoform function using the given param map.
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def outputDataType(self):
    +        """
    +        Returns the data type of the output column as a sql type
    +        """
    +        raise NotImplementedError()
    +
    +    @abstractmethod
    +    def validateInputType(self, inputType):
    +        """
    +        Validates the input type. Throws an exception if it is invalid.
    +        """
    +        raise NotImplementedError()
    +
    +    def transformSchema(self, schema):
    +        inputType = schema[self.getInputCol()].dataType
    +        self.validateInputType(inputType)
    +        if self.getOutputCol() in schema.names:
    +            raise ValueError("Output column %s already exists." % self.getOutputCol())
    +        outputFields = copy.copy(schema.fields)
    +        outputFields.append(StructField(self.getOutputCol(),
    +                                        self.outputDataType(),
    +                                        nullable=False))
    +        return StructType(outputFields)
    +
    +    def transform(self, dataset, paramMap=None):
    +        transformSchema(dataset.schema())
    +        transformUDF = udf(self.createTransformFunc(), self.outputDataType())
    +        dataset.withColumn(self.getOutputCol(), transformUDF(self.getInputCol()))
    --- End diff --
    
    The `udf` need first parameter to be a function, but here why you pass in the return value of `self.createTransformFunc` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org